[PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
From: Terry Bowman
Date: Tue Oct 08 2024 - 18:17:36 EST
This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling. This patchset adds the CXL PCIe
port handling and logging.
The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.
The following 8 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common RCH and VH, adding port specific error handlers, and protocol error
logging.
[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
-1-terry.bowman@xxxxxxx/
Testing:
Below are test results for this patchset. This is using Qemu with a root
port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
(0e:00.0).
This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed).
Root port UCE:
root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
[ 27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[ 27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[ 27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[ 27.322483] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
[ 27.323243] pcieport 0000:0c:00.0: [22] UncorrIntErr
[ 27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
[ 27.325584]
[ 27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
first_error: 'Memory Address Parity Error'
[ 27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
[ 27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
[ 27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 27.335716] Call Trace:
[ 27.335985] <TASK>
[ 27.336226] panic+0x2ed/0x320
[ 27.336547] ? __pfx_cxl_report_normal_detected+0x10/0x10
[ 27.337037] ? __pfx_aer_root_reset+0x10/0x10
[ 27.337453] cxl_do_recovery+0x304/0x310
[ 27.337833] aer_isr+0x3fd/0x700
[ 27.338154] ? __pfx_irq_thread_fn+0x10/0x10
[ 27.338572] irq_thread_fn+0x1f/0x60
[ 27.338923] irq_thread+0x102/0x1b0
[ 27.339267] ? __pfx_irq_thread_dtor+0x10/0x10
[ 27.339683] ? __pfx_irq_thread+0x10/0x10
[ 27.340059] kthread+0xcd/0x100
[ 27.340387] ? __pfx_kthread+0x10/0x10
[ 27.340748] ret_from_fork+0x2f/0x50
[ 27.341100] ? __pfx_kthread+0x10/0x10
[ 27.341466] ret_from_fork_asm+0x1a/0x30
[ 27.341842] </TASK>
[ 27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
Root port CE:
root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
[ 19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
[ 19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
[ 19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
[ 19.447742] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
[ 19.448549] pcieport 0000:0c:00.0: [14] CorrIntErr
[ 19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
[ 19.449223]
[ 19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
Upstream switch port UCE:
root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
[ 45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
[ 45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
[ 45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[ 45.240412] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
[ 45.241159] pcieport 0000:0d:00.0: [22] UncorrIntErr
[ 45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
[ 45.242448]
[ 45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
first_error: 'Memory Address Parity Error'
[ 45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
[ 45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
[ 45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 45.251907] Call Trace:
[ 45.253284] <TASK>
[ 45.253564] panic+0x2ed/0x320
[ 45.253909] ? __pfx_cxl_report_normal_detected+0x10/0x10
[ 45.255455] ? __pfx_aer_root_reset+0x10/0x10
[ 45.255915] cxl_do_recovery+0x304/0x310
[ 45.257219] aer_isr+0x3fd/0x700
[ 45.257572] ? __pfx_irq_thread_fn+0x10/0x10
[ 45.258006] irq_thread_fn+0x1f/0x60
[ 45.258383] irq_thread+0x102/0x1b0
[ 45.258748] ? __pfx_irq_thread_dtor+0x10/0x10
[ 45.259196] ? __pfx_irq_thread+0x10/0x10
[ 45.259605] kthread+0xcd/0x100
[ 45.259956] ? __pfx_kthread+0x10/0x10
[ 45.260386] ret_from_fork+0x2f/0x50
[ 45.260879] ? __pfx_kthread+0x10/0x10
[ 45.261418] ret_from_fork_asm+0x1a/0x30
[ 45.261936] </TASK>
[ 45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
Upstream switch port CE:
root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
[ 37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
[ 37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
[ 37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
[ 37.508759] pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
[ 37.509574] pcieport 0000:0d:00.0: [14] CorrIntErr
[ 37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
[ 37.510180]
[ 37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
Downstream switch port UCE:
root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
[ 29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
[ 29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
[ 29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[ 29.425670] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
[ 29.426487] pcieport 0000:0e:00.0: [22] UncorrIntErr
[ 29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
[ 29.427111]
[ 29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
first_error: 'Memory Address Parity Error'
[ 29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
[ 29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
[ 29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 29.433031] Call Trace:
[ 29.433354] <TASK>
[ 29.433631] panic+0x2ed/0x320
[ 29.434010] ? __pfx_cxl_report_normal_detected+0x10/0x10
[ 29.434653] ? __pfx_aer_root_reset+0x10/0x10
[ 29.435179] cxl_do_recovery+0x304/0x310
[ 29.435626] aer_isr+0x3fd/0x700
[ 29.436027] ? __pfx_irq_thread_fn+0x10/0x10
[ 29.436507] irq_thread_fn+0x1f/0x60
[ 29.436898] irq_thread+0x102/0x1b0
[ 29.437293] ? __pfx_irq_thread_dtor+0x10/0x10
[ 29.437758] ? __pfx_irq_thread+0x10/0x10
[ 29.438189] kthread+0xcd/0x100
[ 29.438551] ? __pfx_kthread+0x10/0x10
[ 29.438959] ret_from_fork+0x2f/0x50
[ 29.439362] ? __pfx_kthread+0x10/0x10
[ 29.439771] ret_from_fork_asm+0x1a/0x30
[ 29.440221] </TASK>
[ 29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
Downstream switch port CE:
root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
[ 177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
[ 177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
[ 177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
[ 177.117985] pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
[ 177.118809] pcieport 0000:0e:00.0: [14] CorrIntErr
[ 177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
[ 177.119521]
[ 177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
Changes RFC->v1:
[Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
[Dan] Add cxl_do_recovery()
[Jonathan] Flatten cxl_setup_parent_uport()
[Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
[Jonathan] Rename cxl_dev_is_pci_type()
[Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
replace these find_cxl_port() and device_find_child().
[Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
[Ming] Dont use endpoint as host to cxl_map_component_regs()
[Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
[TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface
Terry Bowman (15):
cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
driver
cxl/aer/pci: Update is_internal_error() to be callable w/o
CONFIG_PCIEAER_CXL
cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
PCIe ports
cxl/aer/pci: Add CXL PCIe port correctable error support in AER
service driver
cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
PCIe port devices
cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
service driver
cxl/pci: Change find_cxl_ports() to be non-static
cxl/pci: Map CXL PCIe downstream port RAS registers
cxl/pci: Map CXL PCIe upstream port RAS registers
cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
cxl/pci: Add error handler for CXL PCIe port RAS errors
cxl/pci: Add trace logging for CXL PCIe port RAS errors
cxl/aer/pci: Export pci_aer_unmask_internal_errors()
cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/pci.c | 172 +++++++++++++++++++++++++++++++--------
drivers/cxl/core/port.c | 4 +-
drivers/cxl/core/trace.h | 47 +++++++++++
drivers/cxl/cxl.h | 14 +++-
drivers/cxl/mem.c | 30 ++++++-
drivers/cxl/pci.c | 8 ++
drivers/pci/pci.h | 5 ++
drivers/pci/pcie/aer.c | 123 ++++++++++++++++++++--------
drivers/pci/pcie/err.c | 150 ++++++++++++++++++++++++++++++++++
include/linux/aer.h | 16 ++++
include/linux/pci.h | 3 +
12 files changed, 503 insertions(+), 72 deletions(-)
base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
--
2.34.1