Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler

Next message: Vinod Koul: "Re: [PATCH 0/2] ALSA: compress: Robustness improvements in pointer() handling"
Previous message: Joseph Qi: "Re: [PATCH v3] ocfs2: fix use-after-free in ocfs2_fault() when VM_FAULT_RETRY"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Dan Williams

Date: Wed Apr 01 2026 - 23:40:45 EST

Bowman, Terry wrote:
[..]
> > What am I missing?
> >
>
> The USP case needs a PCIe UCE handler added.
>
> CE are cleared by the AER driver. UCE are not cleared by the AER driver and is left to
> the device drivers' handlers to clear.

At least for me this discussion is difficult without a test case. Can we
start with deleting this handler now that CXL errors are handled
elsewhere. Then identify an injection test that shows the missed
handling. Then the patch story becomes clear, something like:

---
cxl/pci: Handle PCI uncorrectable errors

The previous implementation of the error handlers in the cxl_pci driver
were removed after CXL port protocol error handling was moved to the
core. However, that causes uncorrectable error cases to not be handled.
That is unwanted because the default handling causes
$end_user_impact_reason. Here is the output of the
ndctl/test/contrib/$aer-inject script for before and after highlighting
the problem.
---

> > Why does the cxl_pci driver not also assume that the links are down?
> >
>
> I took a best effort during the fatal UCE. It is calling panic after this.

If CXL.cachemem is down, does it need to panic? Seems like in that case
the only concern is for mailbox and MMIO operations. The default
behavior of secondary bus recovery seems sufficient.

if (status == PCI_ERS_RESULT_NEED_RESET ||
state == pci_channel_io_frozen) {
if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
pci_warn(bridge, "subordinate device reset failed\n");
goto failed;
}
}

If that is not sufficient then the changelog should explain why.