Re: [PATCH v7 1/2] acpi/ghes, cxl/pci: Process CXL CPER Protocol Errors
From: Ira Weiny
Date: Tue Mar 04 2025 - 13:10:18 EST
Smita Koralahalli wrote:
> When PCIe AER is in FW-First, OS should process CXL Protocol errors from
> CPER records. Introduce support for handling and logging CXL Protocol
> errors.
>
> The defined trace events cxl_aer_uncorrectable_error and
> cxl_aer_correctable_error trace native CXL AER endpoint errors. Reuse them
> to trace FW-First Protocol errors.
>
> Since the CXL code is required to be called from process context and
> GHES is in interrupt context, use workqueues for processing.
>
> Similar to CXL CPER event handling, use kfifo to handle errors as it
> simplifies queue processing by providing lock free fifo operations.
>
> Add the ability for the CXL sub-system to register a workqueue to
> process CXL CPER protocol errors.
>
> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
> ---
> Comments: There is a potential failure case, and I am seeking feedback.
>
> If a CXL Protocol Error occurs during boot: Both acpi_ghes_init() and
> cxl_core_init() are subsys_initcall. GHES might detect the error and
> trigger cxl_cper_post_prot_err() even before CXL device is completely
> enumerated. (i.e pdev might return NULL OR pdev might succeed and cxlds
> might be NULL as cxl_pci driver is not loaded.)
>
I don't think this is something we should be overly concerned about.
If protocol errors are occurring that early then they are very likely to
be bad hardware which is going to happen once the subsystems are brought
up and start working with the devices. So new errors will alert the user.
> Usage of delayed_workqueue(): Would delaying the handling/logging of
> errors, particularly uncorrectable errors, be acceptable?
> Any alternative suggestions for addressing this issue would be greatly
> appreciated.
>
> Tony questioned choosing value 8 for FIFO_DEPTH in v6. That was just a
> random value that I picked. I would appreciate any suggestions in
> considering the appropriate value for number of entries.
FWIW I think this is fine until someone sees a reason to increase it.
[snip]
> +
> +static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
> + struct cxl_ras_capability_regs ras_cap)
> +{
> + u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
> + struct cxl_dev_state *cxlds;
> +
> + cxlds = pci_get_drvdata(pdev);
> + if (!cxlds)
> + return;
> +
> + trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
I dug into this just a bit wondering if passing cxl_memdev is the best way
for this tracepoint to work given the type 2 work...
+ Alejandro
For now I think this patch is fine. So.
Reviewed-by: Ira Weiny <ira.weiny@xxxxxxxxx>
[snip]