Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
From: Fan Ni
Date: Mon Aug 19 2024 - 14:35:41 EST
On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
>
> I added a response below.
>
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> >> Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> >> Cc: linux-pci@xxxxxxxxxxxxxxx
> >> ---
> >> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>
> >> + /*
> >> + * PCIe ports may include functionality beyond the standard
> >> + * extended port capabilities. This may present a need to log and
> >> + * handle errors not addressed in this driver. Examples are CXL
> >> + * root ports and CXL downstream switch ports using AER UIE to
> >> + * indicate CXL UCE RAS protocol errors.
> >> + */
> >> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> + type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> + struct pci_driver *pdrv = dev->driver;
> >> +
> >> + if (pdrv && pdrv->err_handler &&
> >> + pdrv->err_handler->error_detected) {
> >> + const struct pci_error_handlers *err_handler;
> >> +
> >> + err_handler = pdrv->err_handler;
> >> + status = err_handler->error_detected(dev, state);
> >> + }
> >> + }
> >> +
> >
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
>
> It does. I can take a look at moving that.
>
Based on current code logic, the code added here will be executed as
long as the type matches (downstream port or root port), and I also
noticed the case ->subordinate == NULL never gets touched when I try to
inject an error through the aer_inject module and the user space tool.
If my way to do error injection is right, it means the behaviour will
get changed after the code move.
Here is some of my experimental setup:
QEMU + cxl topology (one type3 memdev directly attached to a HB with a
single root port).
1. Load the cxl related drivers before error injection
2. Do aer inject with aer_inject inside the QEMU VM
# aer_inject ~/nonfatal
aer inject input file looks like below
-----------------------------------------------------
fan:~/cxl/linux-fixes$ cat ~/nonfatal
# Inject an uncorrectable/non-fatal training error into the device
# with header log words 0 1 2 3.
#
# Either specify the PCI id on the command-line option or uncomment and edit
# the PCI_ID line below using the correct PCI ID.
#
# Note that system firmware/BIOS may mask certain errors, change their severity
# and/or not report header log words.
#
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS COMP_ABORT
HEADER_LOG 0 1 2 3
-----------------------------------------------------
The "lspci" output on the VM looks like below
----------------------------------------------------
Qemu: execute "lspci" on VM
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
--------------------------------------------------
Fan
> Regards,
> Terry