Re: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
From: Jonathan Cameron
Date: Fri Apr 04 2025 - 13:04:44 EST
On Wed, 26 Mar 2025 20:47:17 -0500
Terry Bowman <terry.bowman@xxxxxxx> wrote:
> During CXL device cleanup the CXL PCIe Port device interrupts may remain
> enabled. This can potentialy allow unnecessary interrupt processing on
> behalf of the CXL errors while the device is destroyed.
>
> Disable CXL protocol errors by setting the CXL devices' AER mask register.
>
> Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().
>
> Next, introduce cxl_disable_prot_errors() to call pci_aer_mask_internal_errors().
> Register cxl_disable_prot_errors() to run at CXL device cleanup.
> Register for CXL Root Ports, CXL Downstream Ports, CXL Upstream Ports, and
> CXL Endpoints.
>
> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
A few small comments in here. I haven't looked through all the rest of the series
as out of time today but this one caught my eye.
>
> @@ -223,7 +238,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
> struct device *cxlmd_dev __free(put_device) = &cxlmd->dev;
> struct cxl_dev_state *cxlds = cxlmd->cxlds;
>
> - if (!dport || !dev_is_pci(dport->dport_dev)) {
> + if (!dport || !dev_is_pci(dport->dport_dev) || !dev_is_pci(cxlds->dev)) {
> dev_err(&port->dev, "CXL port topology not found\n");
> return;
> }
> @@ -232,6 +247,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>
> cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers);
> cxl_enable_prot_errors(cxlds->dev);
> + devm_add_action_or_reset(cxlds->dev, cxl_disable_prot_errors, cxlds->dev);
This can fail (at least in theory). Should at least scream that oddly we've
disabled error handling interrupts if it is hard to return anything cleanly.
Same for all the other cases.
> }
>
> #else
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d3068f5cc767..d1ef0c676ff8 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> }
> EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>
> +/**
> + * pci_aer_mask_internal_errors - mask internal errors
> + * @dev: pointer to the pcie_dev data structure
> + *
> + * Masks internal errors in the Uncorrectable and Correctable Error
> + * Mask registers.
> + *
> + * Note: AER must be enabled and supported by the device which must be
> + * checked in advance, e.g. with pcie_aer_is_native().
> + */
> +void pci_aer_mask_internal_errors(struct pci_dev *dev)
> +{
> + int aer = dev->aer_cap;
> + u32 mask;
> +
> + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
> + mask |= PCI_ERR_UNC_INTN;
> + pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
> +
It does an extra clear we don't need, but....
pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
0, PCI_ERR_UNC_INTN);
is at very least shorter than the above 3 lines.
> + pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
> + mask |= PCI_ERR_COR_INTERNAL;
> + pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> +}
> +EXPORT_SYMBOL_NS_GPL(pci_aer_mask_internal_errors, "CXL");
> +
> static bool is_cxl_mem_dev(struct pci_dev *dev)
> {
> /*
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index a65fe324fad2..f0c84db466e5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -101,5 +101,6 @@ int cper_severity_to_aer(int cper_severity);
> void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
> int severity, struct aer_capability_regs *aer_regs);
> void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +void pci_aer_mask_internal_errors(struct pci_dev *dev);
> #endif //_AER_H_
>