Re: [PATCH v4 7/7] cxl/memdev: Register for and process CPER events

From: Dan Williams
Date: Mon Dec 18 2023 - 15:57:20 EST


Smita Koralahalli wrote:
> On 12/15/2023 3:26 PM, Ira Weiny wrote:
> > If the firmware has configured CXL event support to be firmware first
> > the OS can process those events through CPER records. The CXL layer has
> > unique DPA to HPA knowledge and standard event trace parsing in place.
> >
> > CPER records contain Bus, Device, Function information which can be used
> > to identify the PCI device which is sending the event.
> >
> > Change pci driver registration to include registration for a CXL CPER
> > notifier to process the events through the trace subsystem.
> >
> > Define and use scoped based management to simplify the handling of the
> > pci device object.
> >
> > Cc: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> > Signed-off-by: Ira Weiny <ira.weiny@xxxxxxxxx>
> >
> > ---
>
> [snip]
>
>
> > + switch (event_type) {
> > + case CXL_CPER_EVENT_GEN_MEDIA:
> > + trace_cxl_general_media(cxlmd, type, &gen_media_event_uuid,
> > + &event->gen_media);
> > + break;
> > + case CXL_CPER_EVENT_DRAM:
> > + trace_cxl_dram(cxlmd, type, &dram_event_uuid, &event->dram);
> > + break;
> > + case CXL_CPER_EVENT_MEM_MODULE:
> > + trace_cxl_memory_module(cxlmd, type, &mem_mod_event_uuid,
> > + &event->mem_module);
> > + break;
> > + }
> > +}
>
> Is default case needed here?

Yeah, it looks like an uninitialized @type value can be passed through
the stack here.

[..]
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 0155fb66b580..638275569d63 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -1,5 +1,6 @@
> > // SPDX-License-Identifier: GPL-2.0-only
> > /* Copyright(c) 2020 Intel Corporation. All rights reserved. */
> > +#include <asm-generic/unaligned.h>
> > #include <linux/io-64-nonatomic-lo-hi.h>
> > #include <linux/moduleparam.h>
> > #include <linux/module.h>
> > @@ -969,6 +970,58 @@ static struct pci_driver cxl_pci_driver = {
> > },
> > };
> >
> > +#define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
> > +static void cxl_cper_event_call(enum cxl_event_type ev_type,
> > + struct cxl_cper_event_rec *rec)
> > +{
> > + struct cper_cxl_event_devid *device_id = &rec->hdr.device_id;
> > + struct pci_dev *pdev __free(pci_dev_put) = NULL;
> > + struct cxl_dev_state *cxlds = NULL;
> > + enum cxl_event_log_type log_type;
> > + unsigned int devfn;
> > + u32 hdr_flags;
> > +
> > + devfn = PCI_DEVFN(device_id->device_num, device_id->func_num);
> > + pdev = pci_get_domain_bus_and_slot(device_id->segment_num,
> > + device_id->bus_num, devfn);
> > + if (!pdev)
> > + return;
> > +
> > + guard(device)(&pdev->dev);
> > + if (pdev->driver == &cxl_pci_driver)
> > + cxlds = pci_get_drvdata(pdev);
> > + if (!cxlds)
> > + return;
> > +
> > + /* Fabricate a log type */
> > + hdr_flags = get_unaligned_le24(rec->event.generic.hdr.flags);
> > + log_type = FIELD_GET(CXL_EVENT_HDR_FLAGS_REC_SEVERITY, hdr_flags);
> > +
> > + cxl_event_trace_record(cxlds->cxlmd, log_type, ev_type, &rec->event);
>
> Currently, when I run this, I see two trace events printed. One from
> here, and another as a non_standard_event from ghes. I think both should
> be unified?
>
> I remember Dan pointing out to me this when I sent decoding for protocol
> errors and its still pending on me for protocol errors.

Good point, so I think the responsibility to trace CXL events should
belong to ghes_do_proc() and ghes_print_estatus() can just ignore CXL
events.

Notice how ghes_proc() sometimes skips ghes_print_estatus(), but
uncoditionally emits a trace event in ghes_do_proc()? To me that means
that the cper_estatus_print() inside ghes_print_estatus() can just defer
to the ghes code to do the hookup to the trace code.

For example, ras_userspace_consumers() was introduced to skip emitting
events to the kernel log when the trace event might be handled. My
assumption is that was for historical reasons, but since CXL events are
new, just never emit them to the kernel log and always require the trace
path.

I am open to other thoughts here, but it seems like ghes_do_proc() is
where the callback needs to be triggered.


> Thanks,
> Smita
>
> > +}
> > +
> > +static int __init cxl_pci_driver_init(void)
> > +{
> > + int rc;
> > +
> > + rc = pci_register_driver(&cxl_pci_driver);
> > + if (rc)
> > + return rc;
> > +
> > + rc = cxl_cper_register_notifier(cxl_cper_event_call);

Quick aside as I am reading through this, the "notifier" name is
misleading since this callback has nothing to do with the
include/linux/notifier.h API.