Re: [PATCH v18 3/4] vfio/pci: Add a reset_done callback for vfio-pci driver
From: Farhan Ali
Date: Mon Jun 08 2026 - 15:30:47 EST
On 6/4/2026 12:57 PM, Alex Williamson wrote:
On Thu, 4 Jun 2026 10:17:04 -0700
Farhan Ali <alifm@xxxxxxxxxxxxx> wrote:
On 6/4/2026 1:28 AM, Keith Busch wrote:I'm starting to feel a little sketchy about this. I asked claude to
On Wed, Jun 03, 2026 at 11:24:14AM -0700, Farhan Ali wrote:I think if the VFIO_DEVICE_RESET ioctl completes successfully it should
+static void vfio_pci_core_aer_reset_done(struct pci_dev *pdev)Shouldn't there be a cooresponding user space notification that the
+{
+ struct vfio_pci_core_device *vdev = dev_get_drvdata(&pdev->dev);
+
+ if (!vdev->pci_saved_state)
+ return;
+
+ pci_load_saved_state(pdev, vdev->pci_saved_state);
+ pci_restore_state(pdev);
+}
device has been restored? There's an eventfd on the error detected side
so user space can know the device needs recovery, but how does it come
to know that the reset is completed?
be an indication that the reset has completed? AFAIU the ioctl will
drive a reset via pci_try_reset_function(). If reset completes completes
successfully the reset_done() callback is called via pci_dev_restore().
So I don't think we need an eventfd to notify on reset completion.
Otherwise we would have the same problem today, where userspace is
unaware that VFIO_DEVICE_RESET did indeed successfully reset the device,
no? Or am I missing something?
enumerate the state restores and the source of that restored state.
Hopefully this ascii table survives:
┌──────────────────────────┬────────────────────────┬─────────────────────┐
│ Step │ Source │ Snapshot-dependent? │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ EXP cap save buffer │ │
│ pci_restore_pcie_state │ (pci_find_saved_cap, │ YES │
│ │ cap.data) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ live │ │
│ pci_restore_pasid_state │ pdev->pasid_enabled + │ no │
│ │ pasid_features │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_pri_state │ live pdev->pri_enabled │ no │
│ │ + pri_reqs_alloc │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_ats_state │ live dev->ats_enabled │ no │
│ │ + ats_stu │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_vc_state │ VC ext-cap save buffer │ YES │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ live resource_size() │ │
│ pci_restore_rebar_state │ (re-derived, written │ no │
│ │ to hw) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_dpc_state │ DPC ext-cap save │ YES │
│ │ buffer │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_ptm_state │ PTM ext-cap save │ YES │
│ │ buffer │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ TPH ext-cap save │ │
│ pci_restore_tph_state │ buffer, gated on live │ YES (gated) │
│ │ tph_enabled │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_aer_clear_status │ clears hw status (not │ n/a │
│ │ a restore) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_aer_state │ ERR ext-cap save │ YES │
│ │ buffer │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ │ saved_config_space[16] │ │
│ pci_restore_config_space │ — type-0 header │ YES │
│ │ (COMMAND, BARs, │ │
│ │ cacheline…) │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_pcix_state │ PCI-X cap save buffer │ YES │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_msi_state │ live msi_desc list + │ no │
│ │ msi(x)_enabled │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_enable_acs │ re-derived from ACS │ no │
│ │ policy │ │
├──────────────────────────┼────────────────────────┼─────────────────────┤
│ pci_restore_iov_state │ live dev->sriov │ no │
│ │ (num_VFs, ctrl) │ │
└──────────────────────────┴────────────────────────┴─────────────────────┘
For things like MSI/X, SR-IOV, RE-BAR, etc. we're actually restoring
from the kernel internal state rather than the save buffer state, so
this is a no-op. However, one thing in that list stands out, TPH.
We don't yet support enabling TPH, but there are series on the list
that propose to add this. The TPH buffer space in the saved state is
allocated just by the capability being present. On open TPH is
disabled and the saved state is untouched, zeros. If TPH is then
enabled and the device reset, the pre-reset save state populates the
TPH save buffer and we restore that state post-reset. With the change
here, reset_done would then push the open saved state. The live TPH
state is enabled, therefore the restore pushes the original open state,
zeros.
This would result in a visible user change and maybe more importantly
shows that we're relying on ad-hoc behavior, without really any specific
policy to have this work reliably. It actually seems like only in the
close function, where we've disabled anything the user might have
enabled, is it really valid to restore the original state. Thanks,
Alex
I was trying to see if we can remove the callback and still restore the device. The original reason why we wanted the callback, was to restore the device state into something sane[1]. Since commit a2f1e22390ac ("PCI/ERR: Ensure error recoverability at all times"), which removed the saved_state check from pci_restore_state(), we will always restore from the last saved state. However, the last saved state in vfio can have the Command register Memory bit disabled (for example could be done through vfio_pci_pre_reset() in QEMU). This becomes problematic when we try to restore MSI-X from in kernel data and when MSI-X is enabled. AFAICT the msix restore path doesn't check if the Memory bit is enabled before writing the MSI-X message, which could cause an Unsupported Request error from the device. From my experiments, enabling Memory bit before restoring MSI-X state was sufficient to get the device in a sane state without this patch.
So we could remove this patch. But maybe there is a gap in MSI-X restoration path?
[1] https://lore.kernel.org/all/20250814145743.204ca19a.alex.williamson@xxxxxxxxxx/
Thanks
Farhan