Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin
Date: Thu Dec 01 2016 - 08:35:01 EST




On 11/30/2016 09:46 AM, Michael S. Tsirkin wrote:
> On Mon, Nov 28, 2016 at 05:32:15PM +0800, Cao jin wrote:
>>
>>

>>>
>>>> + if (severity == AER_FATAL && strcmp(dev->driver->name, "vfio-pci")) {
>>>
>>> You really want some flag in the device, or something similar.
>>> Also, how do we know driver is not going away at this point?
>>>
>>
>> I didn't think of this condition, and I don't quite follow how would driver
>> go away?(device has error happened, then is removed?)
>
> Yes - hotplug request detected. Does something prevent this?
>

I wasn't realized there would be possible to have hotplug happened
during error recovery. But, it seems is the aer driver's thing to
consider it?

>>>> result = reset_link(dev);
>>>> if (result != PCI_ERS_RESULT_RECOVERED)
>>>> goto failed;
>>
>>>> @@ -1187,10 +1200,30 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>>>> return PCI_ERS_RESULT_DISCONNECT;
>>>> }
>>>>
>>>> + /* get device's uncorrectable error status as soon as possible,
>>>> + * and signal it to user space. The later we read it, the possibility
>>>> + * the register value is mangled grows. */
>>>> + aer_cap_offset = pci_find_ext_capability(vdev->pdev, PCI_EXT_CAP_ID_ERR);
>>>> + ret = pci_read_config_dword(vdev->pdev, aer_cap_offset +
>>>> + PCI_ERR_UNCOR_STATUS, &uncor_status);
>>>> + if (ret)
>>>> + return PCI_ERS_RESULT_DISCONNECT;
>>>> +
>>>> + pr_err("device %d got AER detect notification. uncorrectable error status = 0x%x\n", pdev->devfn, uncor_status);//to be removed
>>>> mutex_lock(&vdev->igate);
>>>> +
>>>> + vdev->aer_recovering = true;
>>>> + reinit_completion(&vdev->aer_error_completion);
>>>> +
>>>> + /* suspend config space access from user space,
>>>> + * when vfio-pci's error recovery process is on */
>>>
>>> what about access to memory etc? Do you need to suspend this as well?
>>>
>>
>> Yes, this question came into my mind a little bit, but I didn't see some
>> existing APIs like pci_cfg_access_xxx which can help to do this.(I am still
>> not familiar with kernel)
>
> This isn't easy to do at all.
>

I guess we can use completion in vfio_pci_rw to block all kinds of
access during vfio's error recovery. But from another perspective, now
that we have disabled reset_link in host, I think the access to device
could be performed normally. To me, the benefit of using pci_cfg_access
is: we won't bother to do "wait 3s to do guest's link reset"

>
>>>> + pci_cfg_access_trylock(vdev->pdev);
>>>
>>> If you trylock, you need to handle failure.
>>
>> try lock returns 0 if access is already locked, 1 otherwise. Is it necessary
>> to check its return value?
>
> Locked by whom? You blissfully access as if it's locked by you.
>
>>
>> --
>> Sincerely,
>> Cao jin
>>
>
>
> .
>

--
Sincerely,
Cao jin