Re: [PATCH] vfio/pci: Support error recovery

From: Alex Williamson
Date: Thu Dec 01 2016 - 09:55:49 EST


On Thu, 1 Dec 2016 21:40:00 +0800
Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:

> On 12/01/2016 12:04 PM, Alex Williamson wrote:
> > On Sun, 27 Nov 2016 19:34:17 +0800
> > Cao jin <caoj.fnst@xxxxxxxxxxxxxx> wrote:
> >
> >> It is user space driver's or device-specific driver's(in guest) responsbility
> >> to do a serious recovery when error happened. Link-reset is one part of
> >> recovery, when pci device is assigned to VM via vfio, link-reset will do
> >> twice in host & guest separately, which will cause many trouble for a
> >> successful recovery, so, disable the vfio-pci's link-reset in aer driver
> >> in host, this is a keypoint for guest to do error recovery successfully.
> >>
> >> CC: alex.williamson@xxxxxxxxxx
> >> CC: mst@xxxxxxxxxx
> >> Signed-off-by: Cao jin <caoj.fnst@xxxxxxxxxxxxxx>
> >> ---
> >> This is actually a RFC version(has debug lines left), and has minor changes in
> >> aer driver, so I think maybe it is better not to CC pci guys in this round.
> >> Later will do.
> >>
> >> drivers/pci/pcie/aer/aerdrv_core.c | 12 ++++++-
> >> drivers/vfio/pci/vfio_pci.c | 63 +++++++++++++++++++++++++++++++++++--
> >> drivers/vfio/pci/vfio_pci_private.h | 2 ++
> >> 3 files changed, 74 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> >> index 521e39c..289fb8e 100644
> >> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> >> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> >> @@ -496,7 +496,17 @@ static void do_recovery(struct pci_dev *dev, int severity)
> >> "error_detected",
> >> report_error_detected);
> >>
> >> - if (severity == AER_FATAL) {
> >> + /* vfio-pci as a general meta driver, it actually couldn't do any real
> >> + * recovery for device. It is user space driver, or device-specific
> >> + * driver in guest who should take care of the serious error recovery,
> >> + * link reset actually is one part of whole recovery. Doing reset_link
> >> + * in aer driver of host kernel for vfio-pci devices will cause many
> >> + * trouble for user space driver or guest's device-specific driver,
> >> + * for example: the serious recovery often need to read register in
> >> + * config space, but if register reading happens during link-resetting,
> >> + * it is quite possible to return invalid value like all F's, which
> >> + * will result in unpredictable error. */
> >> + if (severity == AER_FATAL && strcmp(dev->driver->name, "vfio-pci")) {
> >> result = reset_link(dev);
> >> if (result != PCI_ERS_RESULT_RECOVERED)
> >> goto failed;
> >
> > This is not acceptable. If we want to make a path through AER recovery
> > that does not reset the link, there should be a way for the driver to
> > request it. Testing the driver name is a horrible hack. The other
> > question is what guarantees does vfio make that the device does get
> > reset?
>
> I am not sure how vfio guarantee that...When device is assigned to VM,
> we have that guarantees(aer driver in guest driver will do that), so I
> think it is a well-behaved user space driver's responsibility to do link
> reset? And I think if there is a user space driver, it is surely its
> responsibility to consider how to do serious error recovery, like I said
> before, vfio, as a general meta driver, it surely don't know how each
> device does its specific recovery, except bus/slot reset

We can't assume the user it a VM and we certainly cannot assume that
the user is well behaved. By bypassing the link reset here, you've
taken responsibility for assuring that at least that basic level of
recovery is performed. We cannot assume the user will do it though.

> > If an AER fault occurs and the user doesn't do a reset, what
> > happens when that device is released and a host driver tries to make
> > use of it? The user makes no commitment to do a reset and there are
> > only limited configurations where we even allow the user to perform a
> > reset.
> >
>
> Limited? Do you mean the things __pci_dev_reset() can do?

I mean that there are significant device and guest configuration
restrictions in order to support AER. For instance, all the functions
of the slot need to appear in a PCI-e topology in the guest with all
the functions in the right place such that a guest bus reset translates
into a host bus reset. The physical functions cannot be split between
guests even if IOMMU isolation would otherwise allow it. The user
needs to explicitly enable AER support for the devices. A VM need to
be specifically configured for AER support in order to set any sort of
expectations of a guest directed bus reset, let alone a guarantee that
it will happen. So all the existing VMs, where functions are split
between guests, or the topology isn't exactly right, or AER isn't
enabled see a regression from the above change as the device is no
longer reset.

> ...
> >
> >> + aer_cap_offset = pci_find_ext_capability(vdev->pdev, PCI_EXT_CAP_ID_ERR);
> >> + ret = pci_read_config_dword(vdev->pdev, aer_cap_offset +
> >> + PCI_ERR_UNCOR_STATUS, &uncor_status);
> >> + if (ret)
> >> + return PCI_ERS_RESULT_DISCONNECT;
> >> +
> >> + pr_err("device %d got AER detect notification. uncorrectable error status = 0x%x\n", pdev->devfn, uncor_status);//to be removed
> >> mutex_lock(&vdev->igate);
> >> +
> >> + vdev->aer_recovering = true;
> >> + reinit_completion(&vdev->aer_error_completion);
> >> +
> >> + /* suspend config space access from user space,
> >> + * when vfio-pci's error recovery process is on */
> >> + pci_cfg_access_trylock(vdev->pdev);
> >>
> >> - if (vdev->err_trigger)
> >> - eventfd_signal(vdev->err_trigger, 1);
> >> + if (vdev->err_trigger && uncor_status) {
> >> + pr_err("device %d signal uncor status to user space", pdev->devfn);//may be removed
> >> + /* signal uncorrectable error status to user space */
> >> + eventfd_signal(vdev->err_trigger, uncor_status);
> >> + }
> >
> > } else... what? By bypassing the AER link reset we've assumed
> > responsibility for resetting the device. Even if we signal the user,
> > what guarantee do we have that the device is recovered?
> >
>
> else...consider it as a fake error notification and ignore?

Nope. Registering the error eventfd is optional, you've just
introduced a regression where the device was previously reset by the
AER core code and now it's in some unknown state because it generated
an error and we told the core not to reset it.

> I am not sure I understand your thoughts totally, but it seems my
> previous comments apply, that is: it is well-behaved user space driver's
> responsibility to do a serious recovery.

We can *never* assume a well behaved userspace. That's an invalid
starting point.

> In my understanding, user space driver has 2 category: one is VM(has
> guest OS running inside), the other is ordinary user space program.
>
> When device is assigned to a VM, (qemu + guest OS) will do fully steps
> to do recovery(roughly is what struct pci_error_handlers has). So,
> equally, if it is a ordinary user space program acting as the driver,
> the responsibility belongs to it.

This is simply not valid. If vfio is going to assume responsibility
for resetting the device, it can't simply push that responsibility to
userspace. There are only very limited user configurations where a
guest bus reset translating directly to a host bus reset are even
valid. Also keep in mind that we must support old userspace on newer
kernels. There currently does not exist AER recovery on existing
userspace VMs. Thanks,

Alex