Re: [PATCH v7 2/5] PCI/DPC: Run recovery on device that detected the error

From: Lukas Wunner

Date: Fri Feb 27 2026 - 05:48:09 EST


On Fri, Feb 27, 2026 at 04:28:59PM +0800, Shuai Xue wrote:
> On 2/7/26 3:48 PM, Shuai Xue wrote:
> > Regarding pci_restore_state() in slot_reset(): I see now that it does
> > call pci_aer_clear_status(dev) (at line 1844 in pci.c), which will
> > clear the AER Status registers. So if we walk the hierarchy after
> > the slot_reset callbacks, the error bits accumulated during DPC will
> > already be cleared.
> >
> > To avoid losing those errors, I think the walk should happen after
> > dpc_reset_link() succeeds but *before* pcie_do_recovery() invokes the
> > slot_reset callbacks. That way, we can capture the AER Status bits
> > before pci_restore_state() clears them.
> >
> > Does that sound like the right approach, or would you prefer a
> > different placement?

The problem is that if the hierarchy that was reset is deeper than
one level, you first need to call pci_restore_state() on all the
PCIe Upstream and Downstream Ports that were reset before you can
access the Endpoints at the bottom of the hierarchy.

E.g. if DPC occurs at a Root Port with multiple nested PCIe switches
below, the Endpoints at the "leafs" of that tree are only accessible
once Config Space has been restored at all the PCIe switches
in-between the Endpoints and the DPC-capable Root Port.

Hence your proposal unfortunately won't work.

I think the solution is to move pci_aer_clear_status() out of
pci_restore_state() into the callers that actually need it.
But that requires going through every single caller.
I've begun doing that last week and am about 60% done.

Once pci_restore_state() no longer clears the error bits, we can
report and clear them after the "report_slot_reset" stage (which
is where drivers call pci_restore_state()).

I've also changed my mind and I think reporting and clearing
the error bits *could* happen in pcie_do_recovery() even if it
were used for EEH and s390 because those platforms may plug in
AER-capable devices as well and so we do need to clear the bits
regardless of the error recovery mechanism used.

Let me get back to you once I've gone through all the callers of
pci_restore_state(). Please be patient.

Thank you!

Lukas