Re: [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers

From: Bowman, Terry

Date: Wed Mar 11 2026 - 11:23:24 EST

On 3/9/2026 9:00 AM, Jonathan Cameron wrote:
> On Mon, 2 Mar 2026 14:36:44 -0600
> Terry Bowman <terry.bowman@xxxxxxx> wrote:
>
>> Restricted CXL Host (RCH) error handling is not currently supported by the
>> CXL Port error handling flow. Integrate the existing RCH error handling
>> into the new Port error handling.
>>
>> Update cxl_rch_handle_error_iter() to forward the RCH protocol error using
>> the AER-CXL kfifo.
>>
>> Update cxl_handle_proto_error() to begin the RCH error handling with a call
>> to cxl_handle_rdport_errors(). This function handles both correctable and
>> uncorrectable RCH protocol errors.
>>
>> Change the cxl_handle_rdport_errors() function parameter from a CXL device
>> state to a PCI device.
>>
>> Report the serial number of the RCD Endpoint in the RCH logging. This
>> is used to associate the RCH with the RCD in the logs.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@xxxxxxx>
> One question inline.
>
> + a comment on a bit of neighboring code.
>
> J
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 1d4be2d78469..48d3ef7cbb92 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>
>> static void cxl_handle_proto_error(struct pci_dev *pdev, int severity)
>> {
>> + /*
>> + * CXL RCD's AER error interrupt is used for reporting RCD and RCH
>> + * Downstream Port protocol errors. RCH protocol errors are handled
>> + * using a unique procedure separate from from CXL Port devices.
>> + * See CXL spec r4.0, 12.2 CXL Error Handling
>> + */
>> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
>> + w
>
> Maybe I'm missing something but why do we want to carry on running the rest
> of this function after this? Superficially seems like we will be doing
> at least some stuff that didn't happen before.
>

Hi Jonathan,

Before introducing this series, a CXL RCiEP's internal AER interrupt results
in calling the RCH handler (cxl_handle_proto_error()) and the EP handler.
The EP handler was:

scoped_guard(device, dev) {
if (!dev->driver) {
dev_warn(&pdev->dev,
"%s: memdev disabled, abort error handling\n",
dev_name(dev));
return PCI_ERS_RESULT_DISCONNECT;
}

if (cxlds->rcd)
cxl_handle_rdport_errors(cxlds); <== RCH handling
/*
* A frozen channel indicates an impending reset which is fatal to
* CXL.mem operation, and will likely crash the system. On the off
* chance the situation is recoverable dump the status of the RAS
* capability registers and bounce the active state of the memdev.
*/
ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras); <== EP handling
}

-Terry

>> +
>> if (severity == AER_CORRECTABLE) {
>> struct device *dev = &pdev->dev;
>
>> diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
>> index e471eefec9c4..83142eac0cab 100644
>> --- a/drivers/pci/pcie/aer_cxl_rch.c
>> +++ b/drivers/pci/pcie/aer_cxl_rch.c
>> @@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>> static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> {
>> struct aer_err_info *info = (struct aer_err_info *)data;
>
> Not related to this patch but that cast isn't needed.
>
>> - const struct pci_error_handlers *err_handler;
>>