Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin
Date: Fri Dec 16 2016 - 05:11:09 EST




On 12/15/2016 10:50 PM, Michael S. Tsirkin wrote:
> On Thu, Dec 15, 2016 at 09:56:41PM +0800, Cao jin wrote:
>>
>>
>> On 12/15/2016 06:16 AM, Alex Williamson wrote:
>>> On Wed, 14 Dec 2016 18:24:23 +0800
>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>
>>>> Sorry for late.
>>>> after reading all your comments, I think I will try the solution 1.
>>>>
>>>> On 12/13/2016 03:12 AM, Alex Williamson wrote:
>>>>> On Mon, 12 Dec 2016 21:49:01 +0800
>>>>> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>>>>>
>>>>>> Hi,
>>>>>> I have 2 solutions(high level design) came to me, please see if they are
>>>>>> acceptable, or which one is acceptable. Also have some questions.
>>>>>>
>>>>>> 1. block guest access during host recovery
>>>>>>
>>>>>> add new field error_recovering in struct vfio_pci_device to
>>>>>> indicate host recovery status. aer driver in host will still do
>>>>>> reset link
>>>>>>
>>>>>> - set error_recovering in vfio-pci driver's error_detected, used to
>>>>>> block all kinds of user access(config space, mmio)
>>>>>> - in order to solve concurrent issue of device resetting & user
>>>>>> access, check device state[*] in vfio-pci driver's resume, see if
>>>>>> device reset is done, if it is, then clear"error_recovering", or
>>>>>> else new a timer, check device state periodically until device
>>>>>> reset is done. (what if device reset don't end for a long time?)
>>>>>> - In qemu, translate guest link reset to host link reset.
>>>>>> A question here: we already have link reset in host, is a second
>>>>>> link reset necessary? why?
>>>>>>
>>>>>> [*] how to check device state: reading certain config space
>>>>>> register, check return value is valid or not(All F's)
>>>>>
>>>>> Isn't this exactly the path we were on previously?
>>>>
>>>> Yes, it is basically the previous path, plus the optimization.
>>>>
>>>>> There might be an
>>>>> optimization that we could skip back-to-back resets, but how can you
>>>>> necessarily infer that the resets are for the same thing? If the user
>>>>> accesses the device between resets, can you still guarantee the guest
>>>>> directed reset is unnecessary? If time passes between resets, do you
>>>>> know they're for the same event? How much time can pass between the
>>>>> host and guest reset to know they're for the same event? In the
>>>>> process of error handling, which is more important, speed or
>>>>> correctness?
>>>>>
>>>>
>>>> I think vfio driver itself won't know what each reset comes for, and I
>>>> don't quite understand why should vfio care this question, is this a new
>>>> question in the design?
>>>
>>> You're suggesting an optimization to eliminate one of the resets,
>>> and as we've discussed, I don't see removing the host induced reset
>>> as a viable option. That means you want to eliminate the guest
>>> directed reset. There are potentially three levels to do that, the
>>> vfio-pci driver in the host kernel, somewhere in QEMU, or eliminate it
>>> within the guest. My comments were directed to the first option, the
>>> host kernel level cannot correlate user directed resets as duplicates
>>> of host directed resets.
>>>
>>
>> Ah, maybe it is mistake, I don't really want to eliminate guest directed
>> reset very much, I was just not sure why it is very necessary.
>>
>> The optimization I said just is fully separating host recovery from
>> guest recovery(timer, check device periodically) in time, because there
>> is concurrent device resetting & user access.
>>
>>>> But I think it make sense that the user access during 2 resets maybe a
>>>> trouble for guest recovery, misbehaved user could be out of our
>>>> imagination. Correctness is more important.
>>>>
>>>> If I understand you right, let me make a summary: host recovery just
>>>> does link reset, which is incomplete, so we'd better do a complete guest
>>>> recovery for correctness.
>>>
>>> We don't know whether the host link reset is incomplete, but we can't do
>>> a link reset transparently to the device, the device is no longer in the
>>> same state after the reset. The device specific driver, which exists
>>> in userspace needs to be involved in device recovery. Therefore
>>> regardless of how QEMU handles the error, the driver within the guest
>>> needs to be notified and perform recovery. Since the device is PCI and
>>> we're on x86 and nobody wants to introduce paravirtual error recovery,
>>> we must use AER. Part of AER recovery includes the possibility of
>>> performing a link reset. So it seems this eliminates avoiding the link
>>> reset within the guest.
>>>
>>> That leaves QEMU. Here we need to decide whether a guest triggered
>>> link reset induces a host link reset. The current working theory is
>>> that yes, this must be the case. If there is ever a case where a
>>> driver within the guest could trigger a link reset for the purposes
>>> of error recovery when the host has not, I think this must be the
>>> case. Therefore, at least some guest induced link resets must become
>>> host link resets. Currently we assume all guest induced link resets
>>> become host link resets. Minimally to avoid that, QEMU would need to
>>> know (not assume) whether the host performed a link reset. Even with
>>> that, QEMU would need to be able to correlate that a link reset from
>>> the guest is a duplicate of a link reset that was already performed by
>>> the host. That implies that QEMU needs to deduce the intention of
>>> the guest. That seems like a complicated task for a patch series that
>>> is already complicated enough, especially for a feature of questionable
>>> value given the configuration restrictions (imo).
>>>
>>> I would much rather focus on getting it right and making it as simple
>>> as we can, even if that means links get reset one too many times on
>>> error.
>>>
>>
>> Thanks very much for your detailed explanation, it does helps me to
>> understand your concern, understand why a second link reset is necessary.
>>
>> I still want to share my thoughts with you(not argue): now we know host
>> aer driver will do link reset for vfio-pci first, so I can say, even if
>> fatal error is link related, after host link reset, link can work now.
>> Then in qemu, we are not necessary to translate guest link reset to host
>> link reset, just use vfio_pci_reset() as it is to do device
>> reset(probably is FLR). Which also means we don't need following
>> patch(make code easier):
>>
>> @@ -3120,6 +3122,18 @@ static void vfio_pci_reset(DeviceState *dev)
>>
>> trace_vfio_pci_reset(vdev->vbasedev.name);
>>
>> + if (vdev->features & VFIO_FEATURE_ENABLE_AER) {
>> + PCIDevice *br = pci_bridge_get_device(pdev->bus);
>> +
>> + if ((pci_get_word(br->config + PCI_BRIDGE_CONTROL) &
>> + PCI_BRIDGE_CTL_BUS_RESET)) {
>> + if (pci_get_function_0(pdev) == pdev) {
>> + vfio_pci_hot_reset(vdev, vdev->single_depend_dev);
>> + }
>> + return;
>> + }
>> + }
>> +
>> vfio_pci_pre_reset(vdev);
>>
>>
>> I think this also implies: we have a virtual link in qemu, but a virtual
>> link will never be broken like a physical link.(In particular we already
>> know host aer driver surely will do link reset to recover physical
>> link). So, guest's link reset don't need to care whether virtual link is
>> reset, just care virtual device. And qemu "translates guest link reset
>> to host link reset" seems kind of taking link-reset responsibility over
>> from host:)
>>
>>>>>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
>>>>>> Let user decide how to do serious recovery
>>>>>>
>>>>>> add new field "user_driver" in struct pci_dev, used to skip link
>>>>>> reset for vfio-pci; add new field "link_reset" in struct
>>>>>> vfio_pci_device to indicate link has been reset or not during
>>>>>> recovery
>>>>>>
>>>>>> - set user_driver in vfio_pci_probe(), to skip link reset for
>>>>>> vfio-pci in host.
>>>>>> - (use a flag)block user access(config, mmio) during host recovery
>>>>>> (not sure if this step is necessary)
>>>>>> - In qemu, translate guest link reset to host link reset.
>>>>>> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
>>>>>> is executed
>>>>>> - In vfio-pci driver's resume, new a timer, check "link_reset" field
>>>>>> periodically, if it is set in reasonable time, then clear it and
>>>>>> delete timer, or else, vfio-pci driver will does the link reset!
>>>>>
>>>>> What happens in the case of a multifunction device where each function
>>>>> is part of a separate IOMMU group and one function is hot-removed from
>>>>> the user? We can't do a link reset on that function since the other
>>>>> function is still in use. We have no choice but release a device in an
>>>>> unknown state back to the host.
>>>>
>>>> hot-remove from user, do you mean, for example, all functions assigned
>>>> to VM, then suddenly a person does something like following
>>>>
>>>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind
>>>>
>>>> $ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind
>>>>
>>>> to return device to host driver, or don't bind it to host driver, let it
>>>> in driver-less state???
>>>
>>> Yes, the host kernel has no visiblity to how a user is making use of
>>> devices. To support AER we require a similar topology between host and
>>> guest such that a guest link reset translates to a host reset. That
>>> requirement is imposed by userspace, ie. QEMU. The host kernel cannot
>>> presume that this is the case. Therefore we could have a
>>> multi-function device where each function is assigned to the same or
>>> different users in any configuration. If a fault occurs and we defer
>>> to the user to perform the link reset, we have absolutely no guarantee
>>> that it will ever occur. If the functions are assigned to different
>>> users, then each user individually doesn't have the capability to
>>> perform a link reset. If the devices happen to be assigned to a single
>>> user when the error occurs, we cannot assume the user has an AER
>>> compatible configuration, the devices could be exposed as separate
>>> single function devices, any one of which might be individually removed
>>> from the user and made use of by the host, such as your sysfs example
>>> above. The host cannot perform a link reset in this case either
>>> as the sibling devices are still in use by the guest. Thanks,
>>>
>>> Alex
>>>
>>>
>>
>> this explanation is valuable to me, so this is also why we can't do link
>> reset in vfio driver when one of the function is closed. And do link
>> reset in vfio driver until all functions are close is poor solution and
>> very complex(quarantine the device) as you said.
>>
>> I am going to try solution 1, but I still have some consideration share
>> with you, this won't stop my trial, and don't have relationship with
>> above discussion, just FYI:
>>
>> In non-virtuallization environment, from device's perspective, the steps
>> of a normal recovery consists of:
>> error_detect
>> mmio_enabled
>> link_reset
>> slot_reset
>> resume
>>
>> Now in our condition, the steps become:
>> *link_reset* (host's, the following are guest's)
>> error_detect
>> mmio_enabled
>> link_reset
>> slot_reset
>> resume
>>
>> Especially, some device's specific driver in guest could do some
>> specific work in error_detect, take igb_io_error_detected() for example.
>> Like the words in pci-error-recovery.txt said:
>>
>> it gives the driver a chance to cleanup, waiting for pending stuff
>> (timers, whatever, etc...) to complete;
>>
>> But if link_reset is the first step, we lost all the status(register
>> value, etc) in the device. Of course I don't know if this will be a
>> problem (might not), just curious if this has been your concern:)
>
> You'll find I did mention it :)
>
> But consider Documentation/PCI/pcieaer-howto.txt
>
> 3.2.2.2 Non-correctable (non-fatal and fatal) errors
>
> If an error message indicates a non-fatal error, performing link reset
> at upstream is not required. The AER driver calls error_detected(dev,
> pci_channel_io_normal) to all drivers associated within a hierarchy in
> question. for example,
> EndPoint<==>DownstreamPort B<==>UpstreamPort A<==>RootPort.
> If Upstream port A captures an AER error, the hierarchy consists of
> Downstream port B and EndPoint.
>
> A driver may return PCI_ERS_RESULT_CAN_RECOVER,
> PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
> whether it can recover or the AER driver calls mmio_enabled as next.
>
> If an error message indicates a fatal error, kernel will broadcast
> error_detected(dev, pci_channel_io_frozen) to all drivers within
> a hierarchy in question. Then, performing link reset at upstream is
> necessary.
>
> I think that if you just forward errors to guests they will get confused.
> I see three possible approaches.
>
>
> 1. Always pretend to guest that there was a fatal error,
> then basically:
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dce511f..4022f9b 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1299,7 +1299,7 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>
> vfio_device_put(device);
>
> - return PCI_ERS_RESULT_CAN_RECOVER;
> + return PCI_ERS_RESULT_DISCONNECT;
> }
>
> static const struct pci_error_handlers vfio_err_handlers = {
>
>
> probably conditional on userspace invoking some ioctl
> to avoid breaking existing users.
>
> 2. send non fatal error to guest.
> Add another eventfd to distinguish non fatal and fatal errors.
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dce511f..e22f449 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1292,14 +1292,17 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>
> mutex_lock(&vdev->igate);
>
> - if (vdev->err_trigger)
> + if (state == pci_channel_io_normal && vdev->recover_trigger)
> + eventfd_signal(vdev->recover_trigger, 1);
> + else if (vdev->err_trigger)
> eventfd_signal(vdev->err_trigger, 1);
>
> mutex_unlock(&vdev->igate);
>
> vfio_device_put(device);
>
> return PCI_ERS_RESULT_CAN_RECOVER;
> }
>
> static const struct pci_error_handlers vfio_err_handlers = {
>
> Forward non fatal ones to guest, stop vm on fatal ones.
>
>
>
> 3. forward both non fatal and fatal error to guest
> This includes 1 and 2 above, and
>
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index dce511f..4022f9b 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -1299,7 +1299,8 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
>
> vfio_device_put(device);
>
> - return PCI_ERS_RESULT_CAN_RECOVER;
> + return state == pci_channel_io_normal : PCI_ERS_RESULT_CAN_RECOVER :
> + PCI_ERS_RESULT_DISCONNECT;
> }
>
> static const struct pci_error_handlers vfio_err_handlers = {
>
> Maybe make this conditional on recover_trigger to keep
> compatibility.
>
>
> You seem to be starting from 1. But how about starting small, and doing
> 2 as a first step? Fatal errors will still stop vm.
> This will help you merge a bunch of error reporting infrastructure
> without worrying about recovery so much.
>

This seems an interesting and productive proposal to me. Thanks very
much, I am fine with it.

> Making some progress finally will be good.
>
>
> Alex, what do you think?
>
>

--
Sincerely,
Cao jin