Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin
Date: Wed Dec 14 2016 - 05:22:38 EST

Sorry for late.
after reading all your comments, I think I will try the solution 1.

On 12/13/2016 03:12 AM, Alex Williamson wrote:
> On Mon, 12 Dec 2016 21:49:01 +0800
> Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
>> Hi,
>> I have 2 solutions(high level design) came to me, please see if they are
>> acceptable, or which one is acceptable. Also have some questions.
>> 1. block guest access during host recovery
>> add new field error_recovering in struct vfio_pci_device to
>> indicate host recovery status. aer driver in host will still do
>> reset link
>> - set error_recovering in vfio-pci driver's error_detected, used to
>> block all kinds of user access(config space, mmio)
>> - in order to solve concurrent issue of device resetting & user
>> access, check device state[*] in vfio-pci driver's resume, see if
>> device reset is done, if it is, then clear"error_recovering", or
>> else new a timer, check device state periodically until device
>> reset is done. (what if device reset don't end for a long time?)
>> - In qemu, translate guest link reset to host link reset.
>> A question here: we already have link reset in host, is a second
>> link reset necessary? why?
>> [*] how to check device state: reading certain config space
>> register, check return value is valid or not(All F's)
> Isn't this exactly the path we were on previously?

Yes, it is basically the previous path, plus the optimization.

> There might be an
> optimization that we could skip back-to-back resets, but how can you
> necessarily infer that the resets are for the same thing? If the user
> accesses the device between resets, can you still guarantee the guest
> directed reset is unnecessary? If time passes between resets, do you
> know they're for the same event? How much time can pass between the
> host and guest reset to know they're for the same event? In the
> process of error handling, which is more important, speed or
> correctness?

I think vfio driver itself won't know what each reset comes for, and I
don't quite understand why should vfio care this question, is this a new
question in the design?

But I think it make sense that the user access during 2 resets maybe a
trouble for guest recovery, misbehaved user could be out of our
imagination. Correctness is more important.

If I understand you right, let me make a summary: host recovery just
does link reset, which is incomplete, so we'd better do a complete guest
recovery for correctness.

>> 2. skip link reset in aer driver of host kernel, for vfio-pci.
>> Let user decide how to do serious recovery
>> add new field "user_driver" in struct pci_dev, used to skip link
>> reset for vfio-pci; add new field "link_reset" in struct
>> vfio_pci_device to indicate link has been reset or not during
>> recovery
>> - set user_driver in vfio_pci_probe(), to skip link reset for
>> vfio-pci in host.
>> - (use a flag)block user access(config, mmio) during host recovery
>> (not sure if this step is necessary)
>> - In qemu, translate guest link reset to host link reset.
>> - In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
>> is executed
>> - In vfio-pci driver's resume, new a timer, check "link_reset" field
>> periodically, if it is set in reasonable time, then clear it and
>> delete timer, or else, vfio-pci driver will does the link reset!
> What happens in the case of a multifunction device where each function
> is part of a separate IOMMU group and one function is hot-removed from
> the user? We can't do a link reset on that function since the other
> function is still in use. We have no choice but release a device in an
> unknown state back to the host.

hot-remove from user, do you mean, for example, all functions assigned
to VM, then suddenly a person does something like following

$ echo 0000:06:00.0 > /sys/bus/pci/drivers/vfio-pci/unbind

$ echo 0000:06:00.0 > /sys/bus/pci/drivers/igb/bind

to return device to host driver, or don't bind it to host driver, let it
in driver-less state???

> As previously discussed, we don't
> expect that any sort of function-level FLR will necessarily reset the
> device to the same state. I also don't really like vfio-pci taking
> over error handling capabilities from the PCI-core. That's redundant
> code and extra maintenance overhead.

I understand the concern, so I suppose solution 1 is preferred.

Cao jin

>> A quick question:
>> I don't know how devices is divided into iommu groups, is it possible
>> for functions in a multi-function device to be split into different groups?
> Yes, if a multifunction device supports ACS or if we have quirks to
> expose that the functions do not perform internal peer-to-peer, then
> they may be in separate IOMMU groups, depending on the rest of the PCI
> topology. See:
> Thanks,
> Alex
> .