Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler

From: Bowman, Terry

Date: Tue Mar 31 2026 - 15:52:59 EST

On 3/31/2026 2:23 PM, Dan Williams wrote:
> Bowman, Terry wrote:
>> On 3/29/2026 8:22 PM, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> CXL drivers now implement protocol RAS support. PCI protocol errors,
>>>> however, continue to be reported via the AER capability and must still be
>>>> handled by a PCI error recovery callback.
>>>>
>>>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
>>>> new cxl_pci_error_detected() implementation that handles uncorrectable
>>>> AER PCI protocol errors. Changes for PCI Correctable protocol errors will
>>>> be added in a future patch.
>>>>
>>>> Introduce function cxl_uncor_aer_present() to handle and log the CXL
>>>> Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
>>>> the AER driver and require logging here with a call to pci_print_aer().
>>>>
>>>> This cleanly separates CXL protocol error handling from PCI AER handling
>>>> and ensures that each subsystem processes only the errors it is
>>>> responsible.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>>>> Assisted-by: Azure:gpt4.1-nano-key
>>>
>>> This is not a human. If you want a tool to do some of the labor of the
>>> patch I as a reviewer want to know what labor is performed. This tag in
>>> this case tells me nothing actionable about what needs reviewing.
>>>
>>> See: Documentation/process/generated-content.rst
>>>
>>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>>> index 254144d19764..884e40c66638 100644
>>>> --- a/drivers/cxl/core/ras.c
>>>> +++ b/drivers/cxl/core/ras.c
>>>> @@ -393,34 +393,41 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>>>> }
>>>> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>>>
>>>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>>> - pci_channel_state_t state)
>>>> +static bool cxl_uncor_aer_present(struct pci_dev *pdev)
>>>> {
>>>> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>>> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
>>>> - struct device *dev = &cxlmd->dev;
>>>> - bool ue;
>>>> -
>>>> - scoped_guard(device, dev) {
>>>> - if (!dev->driver) {
>>>> - dev_warn(&pdev->dev,
>>>> - "%s: memdev disabled, abort error handling\n",
>>>> - dev_name(dev));
>>>> - return PCI_ERS_RESULT_DISCONNECT;
>>>> - }
>>>> + struct aer_capability_regs aer_regs;
>>>> + u32 fatal, aer_cap = pdev->aer_cap;
>>>>
>>>> - if (cxlds->rcd)
>>>> - cxl_handle_rdport_errors(pdev);
>>>> - /*
>>>> - * A frozen channel indicates an impending reset which is fatal to
>>>> - * CXL.mem operation, and will likely crash the system. On the off
>>>> - * chance the situation is recoverable dump the status of the RAS
>>>> - * capability registers and bounce the active state of the memdev.
>>>> - */
>>>> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
>>>> - cxlmd->endpoint->regs.ras);
>>>> + if (!aer_cap) {
>>>> + pr_warn_ratelimited("%s: AER capability isn't present\n",
>>>> + pci_name(pdev));
>>>> + return false;
>>>
>>> How did the PCI core generate the error if this capability is not
>>> present?
>>>
>>
>> This was taken from the AER driver likely for GHES support. The aer_cap check is not
>> needed here and an be removed.
>>
>>>> }
>>>>
>>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS,
>>>> + &aer_regs.uncor_status);
>>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK,
>>>> + &aer_regs.uncor_mask);
>>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_SEVER,
>>>> + &aer_regs.uncor_severity);
>>>> +
>>>> + fatal = (aer_regs.uncor_severity & aer_regs.uncor_severity);
>>>> + pci_print_aer(pdev, fatal ? AER_FATAL : AER_NONFATAL, &aer_regs);
>>>> +
>>>> + pci_aer_clear_nonfatal_status(pdev);
>>>> + pci_aer_clear_fatal_status(pdev);
>>>> +
>>>> + return aer_regs.uncor_status & ~aer_regs.uncor_mask;
>>>
>>> Is the above doing anything that pcie_do_recovery() is not doing?
>>>
>>> For example, there are zero endpoint drivers in the kernel currently
>>> calling pci_print_aer(), why is cxl_pci special?
>>>
>>
>> Device AER status is typically logged in the AER driver but is not for Upstream
>> Switch Ports and Endpoints with UCE fatal errors. The call to pci_print_aer()
>> is added to provide details about the error. The logic behind the change is if the
>> AER status is already accessed here (to be cleared) it can be logged to the user.
>> pci_print_aer() is used to maintain consistency in logging.
>
> But cxl_pci is only attached to generic CXL memory expander endpoints.
> What about other CXL devices, what about other PCI devices that would
> benefit from general AER core driver handling.
>
> If the driver is not going to do anything be generally log the error
> that feels like a PCI core responsibility.
>
> What am I missing?
>

The USP case needs a PCIe UCE handler added.

CE are cleared by the AER driver. UCE are not cleared by the AER driver and is left to
the device drivers' handlers to clear.

>>>> @@ -441,7 +448,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>>> }
>>>
>>> I am getting the sense none of this is required. When the device is
>>> trained in CXL the bulk of errors will be CXL port errors, when the
>>> device is not trained in CXL it will not be in use. For that remaining
>>> case, why does the driver need error handlers?
>>
>> Uncorrectable errors of CXL Endpoints and Upstream Switch Ports are handled as
>> PCIe AER and not CXL RAS. This is because the AER status is not read in the AER driver.
>> This is not read in the AER driver because the links are assumed to be down.
>
> Why does the cxl_pci driver not also assume that the links are down?
>

I took a best effort during the fatal UCE. It is calling panic after this.

- Terry

>> The AER driver directs the error to be handled by the PCI path instead of CXL
>> handlers.
>
> Right, becuase a CXL link is not found. At that point is just like any
> other PCI device and unless it has some specific recovery action to take
> it does not need to register a handler.