RE: [PATCH] PCI: ERR: Don't override the status returned by error_detect()
From: Z.q. Hou
Date: Wed Jun 10 2020 - 00:07:55 EST
Hi Kuppuswamy,
Thanks a lot for your comments and sorry for my late response!
> -----Original Message-----
> From: Kuppuswamy, Sathyanarayanan
> <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
> Sent: 2020å5æ29æ 12:25
> To: Z.q. Hou <zhiqiang.hou@xxxxxxx>; linux-pci@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; ruscur@xxxxxxxxxx; sbobroff@xxxxxxxxxxxxx;
> oohall@xxxxxxxxx; bhelgaas@xxxxxxxxxx
> Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by
> error_detect()
>
>
>
> On 5/28/20 9:04 PM, Z.q. Hou wrote:
> > Hi Kuppuswamy,
> >
> >> -----Original Message-----
> >> From: Kuppuswamy, Sathyanarayanan
> >> <sathyanarayanan.kuppuswamy@xxxxxxxxxxxxxxx>
> >> Sent: 2020å5æ29æ 5:19
> >> To: Z.q. Hou <zhiqiang.hou@xxxxxxx>; linux-pci@xxxxxxxxxxxxxxx;
> >> linux-kernel@xxxxxxxxxxxxxxx; ruscur@xxxxxxxxxx;
> >> sbobroff@xxxxxxxxxxxxx; oohall@xxxxxxxxx; bhelgaas@xxxxxxxxxx
> >> Subject: Re: [PATCH] PCI: ERR: Don't override the status returned by
> >> error_detect()
> >>
> >> Hi,
> >>
> >> On 5/27/20 1:31 AM, Zhiqiang Hou wrote:
> >>> From: Hou Zhiqiang <Zhiqiang.Hou@xxxxxxx>
> >>>
> >>> The commit 6d2c89441571 ("PCI/ERR: Update error status after
> >>> reset_link()") overrode the 'status' returned by the error_detect()
> >>> call back function, which is depended on by the next step. This
> >>> overriding makes the Endpoint driver's required info (kept in the
> >>> var
> >>> status) lost, so it results in the fatal errors' recovery failed and
> >>> then kernel
> >> panic.
> >> Can you explain why updating status affects the recovery ?
> >
> > Take the e1000e as an example:
> > Once a fatal error is reported by e1000e, the e1000e's error_detect()
> > will be called and it will return PCI_ERS_RESULT_NEED_RESET to request
> > a slot reset, the return value is stored in the '&status' of the
> > calling pci_walk_bus(bus,report_frozen_detected, &status).
> > If you update the 'status' with the reset_link()'s return value
> > (PCI_ERS_RESULT_RECOVERED if the reset link succeed), then the
> > 'status' has the value PCI_ERS_RESULT_RECOVERED and e1000e's request
> > PCI_ERS_RESULT_NEED_RESET lost, then e1000e's callback function
> > .slot_reset() will be skipped and directly call the .resume().
> I believe you are working with non hotplug capable device. If yes, then this
> issue will be addressed by the following patch.
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org
> %2Flkml%2F2020%2F5%2F6%2F1545&data=02%7C01%7Czhiqiang.hou%4
> 0nxp.com%7C4f0ad836e4384f40400f08d80388383c%7C686ea1d3bc2b4c6fa92
> cd99c5c301635%7C0%7C1%7C637263230875781678&sdata=ap0PUMzse
> xIuCnOpBCFPW%2BrUEwMWgAYzT7yxGP8pjio%3D&reserved=0
I'll try this patch, seems it also override the 'status' in the pci_channel_io_frozen
Case but it make sense for me.
Thanks,
Zhiqiang
> >
> > So this is how the update of 'status' break the handshake between RC's
> > AER driver and the Endpoint's protocol driver error_handlers, then result in
> the recovery failure.
> >
> >>>
> >>> In the e1000e case, the error logs:
> >>> pcieport 0002:00:00.0: AER: Uncorrected (Fatal) error received:
> >>> 0002:01:00.0 e1000e 0002:01:00.0: AER: PCIe Bus Error:
> >>> severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent
> >>> ID) pcieport 0002:00:00.0: AER: Root Port link has been reset
> >> As per above commit log, it looks like link is reset correctly.
> >
> > Yes, see my comments above.
> >
> > Thanks,
> > Zhiqiang
> >
> >>> SError Interrupt on CPU0, code 0xbf000002 -- SError
> >>> CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted
> >>> 5.7.0-rc7-next-20200526 #8 Hardware name: LS1046A RDB Board (DT)
> >>> pstate: 80000005 (Nzcv daif -PAN -UAO BTYPE=--) pc :
> >>> __pci_enable_msix_range+0x4c8/0x5b8
> >>> lr : __pci_enable_msix_range+0x480/0x5b8
> >>> sp : ffff80001116bb30
> >>> x29: ffff80001116bb30 x28: 0000000000000003
> >>> x27: 0000000000000003 x26: 0000000000000000
> >>> x25: ffff00097243e0a8 x24: 0000000000000001
> >>> x23: ffff00097243e2d8 x22: 0000000000000000
> >>> x21: 0000000000000003 x20: ffff00095bd46080
> >>> x19: ffff00097243e000 x18: ffffffffffffffff
> >>> x17: 0000000000000000 x16: 0000000000000000
> >>> x15: ffffb958fa0e9948 x14: ffff00095bd46303
> >>> x13: ffff00095bd46302 x12: 0000000000000038
> >>> x11: 0000000000000040 x10: ffffb958fa101e68
> >>> x9 : ffffb958fa101e60 x8 : 0000000000000908
> >>> x7 : 0000000000000908 x6 : ffff800011600000
> >>> x5 : ffff00095bd46800 x4 : ffff00096e7f6080
> >>> x3 : 0000000000000000 x2 : 0000000000000000
> >>> x1 : 0000000000000000 x0 : 0000000000000000 Kernel panic - not
> >>> syncing: Asynchronous SError Interrupt
> >>> CPU: 0 PID: 111 Comm: irq/76-aerdrv Not tainted
> >>> 5.7.0-rc7-next-20200526 #8
> >>>
> >>> I think it's the expected result that "if the initial value of error
> >>> status is PCI_ERS_RESULT_DISCONNECT or
> >> PCI_ERS_RESULT_NO_AER_DRIVER
> >>> then even after successful recovery (using reset_link())
> >>> pcie_do_recovery() will report the recovery result as failure" which
> >>> is described in commit 6d2c89441571 ("PCI/ERR: Update error status
> >>> after
> >> reset_link()").
> >>>
> >>> Refer to the Documentation/PCI/pci-error-recovery.rst.
> >>> As the error_detect() is mandatory callback if the pci_err_handlers
> >>> is implemented, if it return the PCI_ERS_RESULT_DISCONNECT, it means
> >>> the driver doesn't want to recover at all; For the case
> >>> PCI_ERS_RESULT_NO_AER_DRIVER, if the pci_err_handlers is not
> >>> implemented, the failure is more expected.
> >>>
> >>> Fixes: commit 6d2c89441571 ("PCI/ERR: Update error status after
> >>> reset_link()")
> >>> Signed-off-by: Hou Zhiqiang <Zhiqiang.Hou@xxxxxxx>
> >>> ---
> >>> drivers/pci/pcie/err.c | 3 +--
> >>> 1 file changed, 1 insertion(+), 2 deletions(-)
> >>>
> >>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index
> >>> 14bb8f54723e..84f72342259c 100644
> >>> --- a/drivers/pci/pcie/err.c
> >>> +++ b/drivers/pci/pcie/err.c
> >>> @@ -165,8 +165,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev
> >> *dev,
> >>> pci_dbg(dev, "broadcast error_detected message\n");
> >>> if (state == pci_channel_io_frozen) {
> >>> pci_walk_bus(bus, report_frozen_detected, &status);
> >>> - status = reset_link(dev);
> >>> - if (status != PCI_ERS_RESULT_RECOVERED) {
> >>> + if (reset_link(dev) != PCI_ERS_RESULT_RECOVERED) {
> >>> pci_warn(dev, "link reset failed\n");
> >>> goto failed;
> >>> }
> >>>
> >>
> >> --
> >> Sathyanarayanan Kuppuswamy
> >> Linux Kernel Developer
>
> --
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer