Re: [PATCH v3 6/6] PCI/AER: Unmask RCEC internal errors to enable RCH downstream port error handling

From: Bjorn Helgaas
Date: Wed Apr 12 2023 - 17:29:09 EST


On Tue, Apr 11, 2023 at 01:03:02PM -0500, Terry Bowman wrote:
> From: Robert Richter <rrichter@xxxxxxx>
>
> RCEC AER corrected and uncorrectable internal errors (CIE/UIE) are
> disabled by default.

"Disabled by default" just means "the power-up state of CIE/UIC is
that they are masked", right? It doesn't mean that Linux normally
masks them.

> [1][2] Enable them to receive CXL downstream port
> errors of a Restricted CXL Host (RCH).
>
> [1] CXL 3.0 Spec, 12.2.1.1 - RCH Downstream Port Detected Errors
> [2] PCIe Base Spec 6.0, 7.8.4.3 Uncorrectable Error Mask Register,
> 7.8.4.6 Correctable Error Mask Register
>
> Co-developed-by: Terry Bowman <terry.bowman@xxxxxxx>
> Signed-off-by: Robert Richter <rrichter@xxxxxxx>
> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> Cc: "Oliver O'Halloran" <oohall@xxxxxxxxx>
> Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> Cc: Mahesh J Salgaonkar <mahesh@xxxxxxxxxxxxx>
> Cc: linuxppc-dev@xxxxxxxxxxxxxxxx
> Cc: linux-pci@xxxxxxxxxxxxxxx
> ---
> drivers/pci/pcie/aer.c | 73 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 73 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 171a08fd8ebd..3973c731e11d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1000,7 +1000,79 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> pcie_walk_rcec(dev, cxl_handle_error_iter, info);
> }
>
> +static bool cxl_error_is_native(struct pci_dev *dev)
> +{
> + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> +
> + if (pcie_ports_native)
> + return true;
> +
> + return host->native_aer && host->native_cxl_error;
> +}
> +
> +static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> +{
> + int *handles_cxl = data;
> +
> + *handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> +
> + return *handles_cxl;
> +}
> +
> +static bool handles_cxl_errors(struct pci_dev *rcec)
> +{
> + int handles_cxl = 0;
> +
> + if (!rcec->aer_cap)
> + return false;
> +
> + if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC)
> + pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> +
> + return !!handles_cxl;
> +}
> +
> +static int __cxl_unmask_internal_errors(struct pci_dev *rcec)
> +{
> + int aer, rc;
> + u32 mask;
> +
> + /*
> + * Internal errors are masked by default, unmask RCEC's here
> + * PCI6.0 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h)
> + * PCI6.0 7.8.4.6 Correctable Error Mask Register (Offset 14h)
> + */

Unmasking internal errors doesn't have anything specific to do with
CXL, so I don't think it should have "cxl" in the function name.
Maybe something like "pci_aer_unmask_internal_errors()".

This also has nothing special to do with RCECs, so I think we should
refer to the device as "dev" as is typical in this file.

I think this needs to check pcie_aer_is_native() as is done by
pci_aer_clear_nonfatal_status() and other functions that write the AER
Capability.

With the exception of this function, this patch looks like all CXL
code that maybe could be with other CXL code. Would require making
pcie_walk_rcec() available outside drivers/pci, I guess.

> + aer = rcec->aer_cap;
> + rc = pci_read_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, &mask);
> + if (rc)
> + return rc;
> + mask &= ~PCI_ERR_UNC_INTN;
> + rc = pci_write_config_dword(rcec, aer + PCI_ERR_UNCOR_MASK, mask);
> + if (rc)
> + return rc;
> +
> + rc = pci_read_config_dword(rcec, aer + PCI_ERR_COR_MASK, &mask);
> + if (rc)
> + return rc;
> + mask &= ~PCI_ERR_COR_INTERNAL;
> + rc = pci_write_config_dword(rcec, aer + PCI_ERR_COR_MASK, mask);
> +
> + return rc;
> +}
> +
> +static void cxl_unmask_internal_errors(struct pci_dev *rcec)
> +{
> + if (!handles_cxl_errors(rcec))
> + return;
> +
> + if (__cxl_unmask_internal_errors(rcec))
> + dev_err(&rcec->dev, "cxl: Failed to unmask internal errors");
> + else
> + dev_dbg(&rcec->dev, "cxl: Internal errors unmasked");
> +}
> +
> #else
> +static inline void cxl_unmask_internal_errors(struct pci_dev *dev) { }
> static inline void cxl_handle_error(struct pci_dev *dev,
> struct aer_err_info *info) { }
> #endif
> @@ -1397,6 +1469,7 @@ static int aer_probe(struct pcie_device *dev)
> return status;
> }
>
> + cxl_unmask_internal_errors(port);
> aer_enable_rootport(rpc);
> pci_info(port, "enabled with IRQ %d\n", dev->irq);
> return 0;
> --
> 2.34.1
>