Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler
From: Jonathan Cameron
Date: Mon Mar 09 2026 - 10:14:08 EST
On Mon, 2 Mar 2026 14:36:46 -0600
Terry Bowman <terry.bowman@xxxxxxx> wrote:
> CXL drivers now implement protocol RAS support. PCI protocol errors,
> however, continue to be reported via the AER capability and must still be
> handled by a PCI error recovery callback.
>
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
> new cxl_pci_error_detected() implementation that handles uncorrectable
> AER PCI protocol errors. Changes for PCI Correctable protocol errors will
> be added in a future patch.
>
> Introduce function cxl_uncor_aer_present() to handle and log the CXL
> Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
> the AER driver and require logging here with a call to pci_print_aer().
>
> This cleanly separates CXL protocol error handling from PCI AER handling
> and ensures that each subsystem processes only the errors it is
> responsible.
>
> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> Assisted-by: Azure:gpt4.1-nano-key
One question inline.
>
> ---
>
> Changes in v15->v16:
> - Update commit message (DaveJ)
> - s/cxl_handle_aer()/cxl_uncor_aer_present()/g (Jonathan)
> - cxl_uncor_aer_present(): Leave original result calculation based on
> if a UCE is present and the provided state (Terry)
> - Add call to pci_print_aer(). AER fails to log because is upstream
> link (Terry)
>
> Changes in v14->v15:
> - Update commit message and title. Added Bjorn's ack.
> - Move CE and UCE handling logic here
>
> Changes in v13->v14:
> - Add Dave Jiang's review-by
> - Update commit message & headline (Bjorn)
> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> one line (Jonathan)
> - Remove cxl_walk_port() (Dan)
> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> sufficient (Dan)
> - Remove device_lock_if()
> - Combined CE and UCE here (Terry)
>
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
>
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> port searches.
>
> Changes in v10->v11:
> - None
> ---
> drivers/cxl/core/ras.c | 57 ++++++++++++++++++++++++------------------
> drivers/cxl/cxlpci.h | 9 +++----
> drivers/cxl/pci.c | 6 ++---
> 3 files changed, 39 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 254144d19764..884e40c66638 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
...
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> +{
> + bool ue = cxl_uncor_aer_present(pdev);
> + struct cxl_port *port = get_cxl_port(pdev);
This got a reference that wasn't (I think) previously taken.
I'm not spotting where that is released. It it is somewhere beyond
this function, good to add a comment saying where.
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct device *dev = &cxlmd->dev;
> +
> switch (state) {
> case pci_channel_io_normal:
> if (ue) {
> @@ -441,7 +448,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> }
> return PCI_ERS_RESULT_NEED_RESET;
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");