Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler

From: Dan Williams

Date: Sun Mar 29 2026 - 21:22:36 EST


Terry Bowman wrote:
> CXL drivers now implement protocol RAS support. PCI protocol errors,
> however, continue to be reported via the AER capability and must still be
> handled by a PCI error recovery callback.
>
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
> new cxl_pci_error_detected() implementation that handles uncorrectable
> AER PCI protocol errors. Changes for PCI Correctable protocol errors will
> be added in a future patch.
>
> Introduce function cxl_uncor_aer_present() to handle and log the CXL
> Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
> the AER driver and require logging here with a call to pci_print_aer().
>
> This cleanly separates CXL protocol error handling from PCI AER handling
> and ensures that each subsystem processes only the errors it is
> responsible.
>
> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> Assisted-by: Azure:gpt4.1-nano-key

This is not a human. If you want a tool to do some of the labor of the
patch I as a reviewer want to know what labor is performed. This tag in
this case tells me nothing actionable about what needs reviewing.

See: Documentation/process/generated-content.rst

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 254144d19764..884e40c66638 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -393,34 +393,41 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state)
> +static bool cxl_uncor_aer_present(struct pci_dev *pdev)
> {
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
> - struct device *dev = &cxlmd->dev;
> - bool ue;
> -
> - scoped_guard(device, dev) {
> - if (!dev->driver) {
> - dev_warn(&pdev->dev,
> - "%s: memdev disabled, abort error handling\n",
> - dev_name(dev));
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> + struct aer_capability_regs aer_regs;
> + u32 fatal, aer_cap = pdev->aer_cap;
>
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(pdev);
> - /*
> - * A frozen channel indicates an impending reset which is fatal to
> - * CXL.mem operation, and will likely crash the system. On the off
> - * chance the situation is recoverable dump the status of the RAS
> - * capability registers and bounce the active state of the memdev.
> - */
> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
> - cxlmd->endpoint->regs.ras);
> + if (!aer_cap) {
> + pr_warn_ratelimited("%s: AER capability isn't present\n",
> + pci_name(pdev));
> + return false;

How did the PCI core generate the error if this capability is not
present?

> }
>
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS,
> + &aer_regs.uncor_status);
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK,
> + &aer_regs.uncor_mask);
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_SEVER,
> + &aer_regs.uncor_severity);
> +
> + fatal = (aer_regs.uncor_severity & aer_regs.uncor_severity);
> + pci_print_aer(pdev, fatal ? AER_FATAL : AER_NONFATAL, &aer_regs);
> +
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> +
> + return aer_regs.uncor_status & ~aer_regs.uncor_mask;

Is the above doing anything that pcie_do_recovery() is not doing?

For example, there are zero endpoint drivers in the kernel currently
calling pci_print_aer(), why is cxl_pci special?

> +}
> +
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> +{
> + bool ue = cxl_uncor_aer_present(pdev);
> + struct cxl_port *port = get_cxl_port(pdev);
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct device *dev = &cxlmd->dev;
> +
> switch (state) {
> case pci_channel_io_normal:
> if (ue) {
> @@ -441,7 +448,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> }

I am getting the sense none of this is required. When the device is
trained in CXL the bulk of errors will be CXL port errors, when the
device is not trained in CXL it will not be in use. For that remaining
case, why does the driver need error handlers?