RE: [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error

From: Duan, Zhenzhong
Date: Wed May 29 2024 - 01:33:09 EST


Hi,

Kindly ping.
Appreciate comments and suggestions so I could go ahead.

Thanks
Zhenzhong

>-----Original Message-----
>From: Duan, Zhenzhong <zhenzhong.duan@xxxxxxxxx>
>Subject: [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error
>
>Hi,
>
>This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
>processing as subject suggests and drops trace-event for now. I think it's
>a bit heavy to do extra IOes to get PCIe registers only for trace purpose
>and not see it a community request for now.
>
>According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
>6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
>ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
>both Correctable Error(CE) Status register and Uncorrectable Error(UE)
>Status register. Currently, when handling AER events the kernel will only
>look at CE status or UE status, but never both. In the ANFE case, bits set
>in the UE status register will not be reported and cleared until the next
>FE/NFE arrives.
>
>For instance, previously, when the kernel receives an ANFE with Poisoned
>TLP in OS native AER mode, only the status of CE will be reported and
>cleared:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
>
>If the kernel receives a Malformed TLP after that, two UEs will be
>reported, which is unexpected. The Malformed TLP Header is lost since
>the previous ANFE gated the TLP header logs:
>
> PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>(Receiver ID)
> device [8086:0db0] error status/mask=00041000/00180020
> [12] TLP (First)
> [18] MalfTLP
>
>To handle this case properly, calculate potential ANFE related status bits
>and save in aer_err_info. Use this information to determine the status bits
>that need to be cleared.
>
>Now, for the previous scenario, both CE status and related UE status will
>be reported and cleared after ANFE:
>
> AER: Correctable error message received from 0000:b7:02.0
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> device [8086:0db0] error status/mask=00002000/00000000
> [13] NonFatalErr
> Uncorrectable errors that may cause Advisory Non-Fatal:
> [18] TLP
>
>Note:
>checkpatch.pl will produce following warnings on PATCH2/3:
>
>WARNING: 'UE' may be misspelled - perhaps 'USE'?
>#22:
>uncorrectable error(UE) status should be cleared. However, there is no
>
>...similar warnings omitted...
>
>This is a false-positive, so not fixed.
>
>WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit
>description?)
>#10:
> PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>
>...similar warnings omitted...
>
>For readability reasons, these warnings are not fixed.
>
>
>
>[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@xxxxxxxxxxxxxxx
>
>Thanks
>Qingshun, Zhenzhong
>
>Changelog:
>v4:
> - Fix a race in anfe_get_uc_status() (Jonathan)
> - Add a comment to explain side effect of processing ANFE as NFE (Jonathan)
> - Drop the check for PCI_EXP_DEVSTA_NFED
>
>v3:
> - Split ANFE print and processing to two patches (Bjorn)
> - Simplify ANFE handling, drop trace event
> - Polish comments and patch description
> - Add Tested-by
>
>v2:
> - Reference to the latest PCIe Specification in both commit messages
> and comments, as suggested by Bjorn Helgaas.
> - Describe the reason for storing additional information in
> aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
> Helgaas.
> - Add more details of behavior changes in the commit message of PATCH
> 2, as suggested by Bjorn Helgaas.
>
>v3: https://lore.kernel.org/lkml/20240417061407.1491361-1-
>zhenzhong.duan@xxxxxxxxx
>v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@xxxxxxxxxxxxxxx
>v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-
>qingshun.wang@xxxxxxxxxxxxxxx
>
>Zhenzhong Duan (3):
> PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
> PCI/AER: Print UNCOR_STATUS bits that might be ANFE
> PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
>
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 75
>+++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 75 insertions(+), 1 deletion(-)
>
>--
>2.34.1