Re: [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow
From: Jonathan Cameron
Date: Thu May 07 2026 - 14:22:39 EST
On Tue, 5 May 2026 12:30:24 -0500
Terry Bowman <terry.bowman@xxxxxxx> wrote:
> Add CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints. Establish a
> common flow for correctable and uncorrectable CXL protocol errors.
> RCH Downstream Port error handling is added in a following patch.
>
> Add cxl_handle_proto_error() to dispatch correctable and uncorrectable
> errors through the CXL RAS helpers. Add cxl_do_recovery() to coordinate
> uncorrectable recovery. Panic via panic() on any uncorrectable CXL RAS
> error. CXL.cachemem traffic cannot be safely recovered from an
> uncorrectable protocol error in software, so panic regardless of the
> AER severity reported. Gate error handling on the port driver being
> bound to avoid processing errors on disabled devices.
>
> Panic explicitly on pci_dev_is_disconnected() before accessing the RAS
> registers. A CXL device disconnecting during an uncorrectable error event
> is itself unrecoverable, particularly for devices in interleaved HDM
> regions. Relying on the status readl() returning ~0u to trip the existing
> panic path leaves the cause ambiguous.
>
> The panic policy applies to the RAS register block of the device whose
> error triggered the recovery: Root/Downstream Port RAS for VH Ports,
> Endpoint Port RAS for VH Endpoints and RCDs. Upstream RCH Downstream
> Port RAS UEs handled via cxl_handle_rdport_errors() are logged only, as
> before this series. Only the RCD Endpoint's own RAS UE drives the panic.
>
> Add to_ras_base() to centralize the RAS base lookup. It selects
> dport->regs.ras for Root/Downstream Ports and port->regs.ras for
> Upstream Ports and Endpoints.
>
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() so
> cxl_core can clear PCIe/AER state during recovery.
>
> Wire the AER core to the kfifo in this commit by adding the
> is_cxl_error() switch in handle_error_source() alongside the consumer
> registration. This way the producer and consumer go live in the same
> commit, so CXL errors are not silently dropped during bisect.
>
> The correctable AER status is cleared by the producer in
> cxl_forward_error().
>
> Co-developed-by: Dan Williams <djbw@xxxxxxxxxx>
> Signed-off-by: Dan Williams <djbw@xxxxxxxxxx>
> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
>
A few trivial things inline. With those tidied up
Reviewed-by: Jonathan Cameron <jic23@xxxxxxxxxx>
> + * find_cxl_port_by_dev - Use @dev as hint to do a _by_dport or _by_uport lookup
> + * @dev: generic device that may either be a companion of port or target dport
> + * @dport: output parameter; set to the matched dport for dport-class
> + * lookups (Root Port, Downstream Port), NULL otherwise.
> + *
> + * Return a 'struct cxl_port' with an elevated reference if found. Use
> + * __free(put_cxl_port) to release.
> + */
> +static struct cxl_port *find_cxl_port_by_dev(struct device *dev, struct cxl_dport **dport)
> +{
> + struct pci_dev *pdev;
> +
> + *dport = NULL;
> + if (!dev_is_pci(dev))
> + return NULL;
> +
> + pdev = to_pci_dev(dev);
Only used once. So little point in this step...
> +
> + switch (pci_pcie_type(pdev)) {
switch (pci_pcie_type(to_pci_dev(dev))) {
looks readable enough to me.
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + return find_cxl_port_by_dport(dev, dport);
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + case PCI_EXP_TYPE_RC_END:
> + return find_cxl_port_by_uport(dev);
> + }
> +
> + return NULL;
> +}
> +
> +static void cxl_do_recovery(struct pci_dev *pdev, struct cxl_port *port, struct cxl_dport *dport)
> +{
> + struct device *dev = &pdev->dev;
> + bool ue;
> +
> + if (pci_dev_is_disconnected(pdev))
> + panic("CXL cachemem error: device disconnected during UE recovery");
> +
> + ue = cxl_handle_ras(dev, pci_get_dsn(pdev),
> + to_ras_base(port, dport));
My lazy (or maybe busy) nature means I haven't checked, but if this remains
the same for rest of series it fits on one line of around 78 chars.
> + if (ue)
> + panic("CXL cachemem error.");
> +
> + pcie_clear_device_status(pdev);
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> +}
> +int cxl_ras_init(void)
> +{
> + cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> + cxl_register_proto_err_work(&cxl_proto_err_work);
> +
> + return 0;
void cxl_ras_init() as per earlier suggestion still looks good ;)
> +}
> +
> +void cxl_ras_exit(void)
> +{
> + cxl_cper_unregister_prot_err_work();
> + cxl_unregister_proto_err_work();
> +}