Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler

From: Bowman, Terry

Date: Tue Mar 31 2026 - 14:53:17 EST

On 3/29/2026 8:22 PM, Dan Williams wrote:
> Terry Bowman wrote:
>> CXL drivers now implement protocol RAS support. PCI protocol errors,
>> however, continue to be reported via the AER capability and must still be
>> handled by a PCI error recovery callback.
>>
>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
>> new cxl_pci_error_detected() implementation that handles uncorrectable
>> AER PCI protocol errors. Changes for PCI Correctable protocol errors will
>> be added in a future patch.
>>
>> Introduce function cxl_uncor_aer_present() to handle and log the CXL
>> Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
>> the AER driver and require logging here with a call to pci_print_aer().
>>
>> This cleanly separates CXL protocol error handling from PCI AER handling
>> and ensures that each subsystem processes only the errors it is
>> responsible.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>> Assisted-by: Azure:gpt4.1-nano-key
>
> This is not a human. If you want a tool to do some of the labor of the
> patch I as a reviewer want to know what labor is performed. This tag in
> this case tells me nothing actionable about what needs reviewing.
>
> See: Documentation/process/generated-content.rst
>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 254144d19764..884e40c66638 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -393,34 +393,41 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>
>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> - pci_channel_state_t state)
>> +static bool cxl_uncor_aer_present(struct pci_dev *pdev)
>> {
>> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
>> - struct device *dev = &cxlmd->dev;
>> - bool ue;
>> -
>> - scoped_guard(device, dev) {
>> - if (!dev->driver) {
>> - dev_warn(&pdev->dev,
>> - "%s: memdev disabled, abort error handling\n",
>> - dev_name(dev));
>> - return PCI_ERS_RESULT_DISCONNECT;
>> - }
>> + struct aer_capability_regs aer_regs;
>> + u32 fatal, aer_cap = pdev->aer_cap;
>>
>> - if (cxlds->rcd)
>> - cxl_handle_rdport_errors(pdev);
>> - /*
>> - * A frozen channel indicates an impending reset which is fatal to
>> - * CXL.mem operation, and will likely crash the system. On the off
>> - * chance the situation is recoverable dump the status of the RAS
>> - * capability registers and bounce the active state of the memdev.
>> - */
>> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
>> - cxlmd->endpoint->regs.ras);
>> + if (!aer_cap) {
>> + pr_warn_ratelimited("%s: AER capability isn't present\n",
>> + pci_name(pdev));
>> + return false;
>
> How did the PCI core generate the error if this capability is not
> present?
>

This was taken from the AER driver likely for GHES support. The aer_cap check is not
needed here and an be removed.

>> }
>>
>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS,
>> + &aer_regs.uncor_status);
>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK,
>> + &aer_regs.uncor_mask);
>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_SEVER,
>> + &aer_regs.uncor_severity);
>> +
>> + fatal = (aer_regs.uncor_severity & aer_regs.uncor_severity);
>> + pci_print_aer(pdev, fatal ? AER_FATAL : AER_NONFATAL, &aer_regs);
>> +
>> + pci_aer_clear_nonfatal_status(pdev);
>> + pci_aer_clear_fatal_status(pdev);
>> +
>> + return aer_regs.uncor_status & ~aer_regs.uncor_mask;
>
> Is the above doing anything that pcie_do_recovery() is not doing?
>
> For example, there are zero endpoint drivers in the kernel currently
> calling pci_print_aer(), why is cxl_pci special?
>

Device AER status is typically logged in the AER driver but is not for Upstream
Switch Ports and Endpoints with UCE fatal errors. The call to pci_print_aer()
is added to provide details about the error. The logic behind the change is if the
AER status is already accessed here (to be cleared) it can be logged to the user.
pci_print_aer() is used to maintain consistency in logging.

>> +}
>> +
>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> + pci_channel_state_t state)
>> +{
>> + bool ue = cxl_uncor_aer_present(pdev);
>> + struct cxl_port *port = get_cxl_port(pdev);
>> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> + struct device *dev = &cxlmd->dev;
>> +
>> switch (state) {
>> case pci_channel_io_normal:
>> if (ue) {
>> @@ -441,7 +448,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> }
>
> I am getting the sense none of this is required. When the device is
> trained in CXL the bulk of errors will be CXL port errors, when the
> device is not trained in CXL it will not be in use. For that remaining
> case, why does the driver need error handlers?

Uncorrectable errors of CXL Endpoints and Upstream Switch Ports are handled as
PCIe AER and not CXL RAS. This is because the AER status is not read in the AER driver.
This is not read in the AER driver because the links are assumed to be down.
The AER driver directs the error to be handled by the PCI path instead of CXL
handlers.

- Terry