Re: [PATCH] vfio/pci: Support error recovery

From: Cao jin
Date: Mon Dec 12 2016 - 08:44:59 EST


Hi,
I have 2 solutions(high level design) came to me, please see if they are
acceptable, or which one is acceptable. Also have some questions.

1. block guest access during host recovery

add new field error_recovering in struct vfio_pci_device to
indicate host recovery status. aer driver in host will still do
reset link

- set error_recovering in vfio-pci driver's error_detected, used to
block all kinds of user access(config space, mmio)
- in order to solve concurrent issue of device resetting & user
access, check device state[*] in vfio-pci driver's resume, see if
device reset is done, if it is, then clear"error_recovering", or
else new a timer, check device state periodically until device
reset is done. (what if device reset don't end for a long time?)
- In qemu, translate guest link reset to host link reset.
A question here: we already have link reset in host, is a second
link reset necessary? why?

[*] how to check device state: reading certain config space
register, check return value is valid or not(All F's)

2. skip link reset in aer driver of host kernel, for vfio-pci.
Let user decide how to do serious recovery

add new field "user_driver" in struct pci_dev, used to skip link
reset for vfio-pci; add new field "link_reset" in struct
vfio_pci_device to indicate link has been reset or not during
recovery

- set user_driver in vfio_pci_probe(), to skip link reset for
vfio-pci in host.
- (use a flag)block user access(config, mmio) during host recovery
(not sure if this step is necessary)
- In qemu, translate guest link reset to host link reset.
- In vfio-pci driver, set link_reset after VFIO_DEVICE_PCI_HOT_RESET
is executed
- In vfio-pci driver's resume, new a timer, check "link_reset" field
periodically, if it is set in reasonable time, then clear it and
delete timer, or else, vfio-pci driver will does the link reset!


A quick question:
I don't know how devices is divided into iommu groups, is it possible
for functions in a multi-function device to be split into different groups?

--
Sincerely,
Cao jin


On 11/27/2016 07:34 PM, Cao jin wrote:
> It is user space driver's or device-specific driver's(in guest) responsbility
> to do a serious recovery when error happened. Link-reset is one part of
> recovery, when pci device is assigned to VM via vfio, link-reset will do
> twice in host & guest separately, which will cause many trouble for a
> successful recovery, so, disable the vfio-pci's link-reset in aer driver
> in host, this is a keypoint for guest to do error recovery successfully.
>
> CC: alex.williamson@xxxxxxxxxx
> CC: mst@xxxxxxxxxx
> Signed-off-by: Cao jin <caoj.fnst@xxxxxxxxxxxxxx>
> ---
> This is actually a RFC version(has debug lines left), and has minor changes in
> aer driver, so I think maybe it is better not to CC pci guys in this round.
> Later will do.
>
> drivers/pci/pcie/aer/aerdrv_core.c | 12 ++++++-
> drivers/vfio/pci/vfio_pci.c | 63 +++++++++++++++++++++++++++++++++++--
> drivers/vfio/pci/vfio_pci_private.h | 2 ++
> 3 files changed, 74 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index 521e39c..289fb8e 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -496,7 +496,17 @@ static void do_recovery(struct pci_dev *dev, int severity)
> "error_detected",
> report_error_detected);
>
> - if (severity == AER_FATAL) {
> + /* vfio-pci as a general meta driver, it actually couldn't do any real
> + * recovery for device. It is user space driver, or device-specific
> + * driver in guest who should take care of the serious error recovery,
> + * link reset actually is one part of whole recovery. Doing reset_link
> + * in aer driver of host kernel for vfio-pci devices will cause many
> + * trouble for user space driver or guest's device-specific driver,
> + * for example: the serious recovery often need to read register in
> + * config space, but if register reading happens during link-resetting,
> + * it is quite possible to return invalid value like all F's, which
> + * will result in unpredictable error. */
> + if (severity == AER_FATAL && strcmp(dev->driver->name, "vfio-pci")) {
> result = reset_link(dev);
> if (result != PCI_ERS_RESULT_RECOVERED)
> goto failed;
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 712a849..aefd751 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -535,6 +535,15 @@ static long vfio_pci_ioctl(void *device_data,
> struct vfio_pci_device *vdev = device_data;
> unsigned long minsz;
>
> + if (vdev->aer_recovering && (cmd == VFIO_DEVICE_SET_IRQS ||
> + cmd == VFIO_DEVICE_RESET || cmd == VFIO_DEVICE_PCI_HOT_RESET)) {
> + int ret;
> + ret = wait_for_completion_interruptible(
> + &vdev->aer_error_completion);
> + if (ret)
> + return ret;
> + }
> +
> if (cmd == VFIO_DEVICE_GET_INFO) {
> struct vfio_device_info info;
>
> @@ -1117,6 +1126,7 @@ static int vfio_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> vdev->irq_type = VFIO_PCI_NUM_IRQS;
> mutex_init(&vdev->igate);
> spin_lock_init(&vdev->irqlock);
> + init_completion(&vdev->aer_error_completion);
>
> ret = vfio_add_group_dev(&pdev->dev, &vfio_pci_ops, vdev);
> if (ret) {
> @@ -1176,6 +1186,9 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
> {
> struct vfio_pci_device *vdev;
> struct vfio_device *device;
> + u32 uncor_status = 0;
> + unsigned int aer_cap_offset = 0;
> + int ret;
>
> device = vfio_device_get_from_dev(&pdev->dev);
> if (device == NULL)
> @@ -1187,10 +1200,30 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
> return PCI_ERS_RESULT_DISCONNECT;
> }
>
> + /* get device's uncorrectable error status as soon as possible,
> + * and signal it to user space. The later we read it, the possibility
> + * the register value is mangled grows. */
> + aer_cap_offset = pci_find_ext_capability(vdev->pdev, PCI_EXT_CAP_ID_ERR);
> + ret = pci_read_config_dword(vdev->pdev, aer_cap_offset +
> + PCI_ERR_UNCOR_STATUS, &uncor_status);
> + if (ret)
> + return PCI_ERS_RESULT_DISCONNECT;
> +
> + pr_err("device %d got AER detect notification. uncorrectable error status = 0x%x\n", pdev->devfn, uncor_status);//to be removed
> mutex_lock(&vdev->igate);
> +
> + vdev->aer_recovering = true;
> + reinit_completion(&vdev->aer_error_completion);
> +
> + /* suspend config space access from user space,
> + * when vfio-pci's error recovery process is on */
> + pci_cfg_access_trylock(vdev->pdev);
>
> - if (vdev->err_trigger)
> - eventfd_signal(vdev->err_trigger, 1);
> + if (vdev->err_trigger && uncor_status) {
> + pr_err("device %d signal uncor status to user space", pdev->devfn);//may be removed
> + /* signal uncorrectable error status to user space */
> + eventfd_signal(vdev->err_trigger, uncor_status);
> + }
>
> mutex_unlock(&vdev->igate);
>
> @@ -1199,8 +1232,34 @@ static pci_ers_result_t vfio_pci_aer_err_detected(struct pci_dev *pdev,
> return PCI_ERS_RESULT_CAN_RECOVER;
> }
>
> +static void vfio_pci_aer_resume(struct pci_dev *pdev)
> +{
> + struct vfio_pci_device *vdev;
> + struct vfio_device *device;
> +
> + device = vfio_device_get_from_dev(&pdev->dev);
> + if (device == NULL)
> + return;
> +
> + vdev = vfio_device_data(device);
> + if (vdev == NULL) {
> + vfio_device_put(device);
> + return;
> + }
> +
> + /* vfio-pci's error recovery is done, time to
> + * resume pci config space's accesses */
> + pci_cfg_access_unlock(vdev->pdev);
> +
> + vdev->aer_recovering = false;
> + complete_all(&vdev->aer_error_completion);
> +
> + vfio_device_put(device);
> +}
> +
> static const struct pci_error_handlers vfio_err_handlers = {
> .error_detected = vfio_pci_aer_err_detected,
> + .resume = vfio_pci_aer_resume,
> };
>
> static struct pci_driver vfio_pci_driver = {
> diff --git a/drivers/vfio/pci/vfio_pci_private.h b/drivers/vfio/pci/vfio_pci_private.h
> index 8a7d546..ebf1041 100644
> --- a/drivers/vfio/pci/vfio_pci_private.h
> +++ b/drivers/vfio/pci/vfio_pci_private.h
> @@ -83,6 +83,8 @@ struct vfio_pci_device {
> bool bardirty;
> bool has_vga;
> bool needs_reset;
> + bool aer_recovering;
> + struct completion aer_error_completion;
> struct pci_saved_state *pci_saved_state;
> int refcnt;
> struct eventfd_ctx *err_trigger;
>