RE: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
From: Z.q. Hou
Date: Fri May 29 2020 - 00:04:46 EST
Hi Kuppuswamy,
> -----Original Message-----
> From: Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
> Sent: 2020å5æ29æ 5:19
> To: Z.q. Hou <zhiqiang.hou@xxxxxxx>; linux-pci@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; ruscur@xxxxxxxxxx; sbobroff@xxxxxxxxxxxxx;
> oohall@xxxxxxxxx; bhelgaas@xxxxxxxxxx
> Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by
> error_detect()
>
> Hi,
>
> On 5/27/20 1:31 AM, Zhiqiang Hou wrote:
> > From: Hou Zhiqiang <Zhiqiang.Hou@xxxxxxx>
> >
> > The commit 6d2c89441571 ("PCI/ERR: Update error status after
> > reset_link()") overrode the 'status' returned by the error_detect()
> > call back function, which is depended on by the next step. This
> > overriding makes the Endpoint driver's required info (kept in the var
> > status) lost, so it results in the fatal errors' recovery failed and then kernel
> panic.
> Can you explain why updating status affects the recovery ?
Take the e1000e as an example:
Once a fatal error is reported by e1000e, the e1000e's error_detect() will be
called and it will return PCI_ERS_RESULT_NEED_RESET to request a slot reset,
the return value is stored in the '&status' of the calling
pci_walk_bus(bus,report_frozen_detected, &status).
If you update the 'status' with the reset_link()'s return value
(PCI_ERS_RESULT_RECOVERED if the reset link succeed), then the 'status' has
the value PCI_ERS_RESULT_RECOVERED and e1000e's request
PCI_ERS_RESULT_NEED_RESET lost, then e1000e's callback function .slot_reset()
will be skipped and directly call the .resume().
So this is how the update of 'status' break the handshake between RC's AER driver
and the Endpoint's protocol driver error_handlers, then result in the recovery failure.
> >
> > In the e1000e case, the error logs:
> > pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received:
> > 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error:
> > severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent
> > ID) pcieport 0002:00:00.0: AER: Root Port link has been reset
> As per above commit log, it looks like link is reset correctly.
Yes, see my comments above.
Thanks,
Zhiqiang
> > SError Interrupt on CPU0, code 0xbf000002 -- SError
> > CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted
> > 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT)
> > pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--) pc :
> > __pci_enable_msix_range+0x4c8/0x5b8
> > lr : __pci_enable_msix_range+0x480/0x5b8
> > sp : ffff80001116bb30
> > x29: ffff80001116bb30 x28: 0000000000000003
> > x27: 0000000000000003 x26: 0000000000000000
> > x25: ffff00097243e0a8 x24: 0000000000000001
> > x23: ffff00097243e2d8 x22: 0000000000000000
> > x21: 0000000000000003 x20: ffff00095bd46080
> > x19: ffff00097243e000 x18: ffffffffffffffff
> > x17: 0000000000000000 x16: 0000000000000000
> > x15: ffffb958fa0e9948 x14: ffff00095bd46303
> > x13: ffff00095bd46302 x12: 0000000000000038
> > x11: 0000000000000040 x10: ffffb958fa101e68
> > x9 : ffffb958fa101e60 x8 : 0000000000000908
> > x7 : 0000000000000908 x6 : ffff800011600000
> > x5 : ffff00095bd46800 x4 : ffff00096e7f6080
> > x3 : 0000000000000000 x2 : 0000000000000000
> > x1 : 0000000000000000 x0 : 0000000000000000 Kernel panic - not
> > syncing: Asynchronous SError Interrupt
> > CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted
> > 5.7.0-rc7-next-20200526 #8
> >
> > I think it's the expected result that "if the initial value of error
> > status is PCI_ERS_RESULT_DISCONNECT or
> PCI_ERS_RESULT_NO_AER_DRIVER
> > then even after successful recovery (using reset_link())
> > pcie_do_recovery() will report the recovery result as failure" which
> > is described in commit 6d2c89441571 ("PCI/ERR: Update error status after
> reset_link()").
> >
> > Refer to the Documentation/PCI/pci-error-recovery.rst.
> > As the error_detect() is mandatory callback if the pci_err_handlers is
> > implemented, if it return the PCI_ERS_RESULT_DISCONNECT, it means the
> > driver doesn't want to recover at all; For the case
> > PCI_ERS_RESULT_NO_AER_DRIVER, if the pci_err_handlers is not
> > implemented, the failure is more expected.
> >
> > Fixes: commit 6d2c89441571 ("PCI/ERR: Update error status after
> > reset_link()")
> > Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@xxxxxxx>
> > ---
> > drivers/pci/pcie/err.c | 3 +--
> > 1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index
> > 14bb8f54723e..84f72342259c 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev
> *dev,
> > pci_dbg(dev, "broadcast error_detected message\n");
> > if (state == pci_channel_io_frozen) {
> > pci_walk_bus(bus, report_frozen_detected, &status);
> > - status = reset_link(dev);
> > - if (status != PCI_ERS_RESULT_RECOVERED) {
> > + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
> > pci_warn(dev, "link reset failed\n");
> > goto failed;
> > }
> >
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer