RE: [PATCH 0/3] CXL Poison List Retrieval & Tracing

From: Dan Williams
Date: Fri Jun 17 2022 - 13:57:42 EST


alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@xxxxxxxxx>
>
> Introducing the first piece of support for CXL Media Errors,
> offering the ability to retrieve a devices poison list and
> store the returned error records as kernel trace events.
>
> The handling of the poison list is guided by the CXL 2.0 Spec
> Section 8.2.9.5.4.1. [1] The usage of Trace Events to store the
> Media Error records is a first look at the proposed handling
> of CXL ARS events.

Nice! Thanks for getting this started.

>
> Example command line usage:
>
> $ trace-cmd record -e cxl_poison_list
> $ echo 1 > /sys/bus/cxl/devices/mem1/get_poison

Perhaps call this 'trace_poison_cached', and later call the one that
does scan_media 'trace_poison'? Otherwise it is odd that an attribute
named "get" is write-only and returns nothing but an error code.


> $ trace-cmd report trace.dat
>
> cxl_poison_list: memdev: mem3 source EXTERNAL start 0x41 length 0x2

Perhaps just make the source the raw number value so there is no need to
go change the kernel if that reserved field grows more definitions, or
if userspace grows knowledge of what values 0x4, 0x5, and 0x6 mean for a
given device.

> cxl_poison_list: memdev: mem3 source INTERNAL start 0xc2 length 0x3

I wonder if this wants to eventually be unified into a common format
with the General Media Event record? The idea being that there will be
multiple sources and triggers for CXL subsystem trace events. Lets take
the scenario of poison being written to a device by a DMA agent. In that
case the device may fire a General Media Event with the Uncorrectable
indicator and the DPA. Userspace will want to distinguish that from the
Media Error record, but probably benefits from using similar event
parsing. The trace event could be modified with something like an
"origin" key to say whether the event was asynchronously generated by
the device, or synchronously requested by a given process. Something
like: "origin:<device|pid>".

> cxl_poison_list: memdev: mem3 source INJECTED start 0x183 length 0x4
> cxl_poison_list: memdev: mem3 source INVALID start 0x284 length 0x5
> cxl_poison_list: memdev: mem3 source VENDOR start 0x707 length 0x8

>From a parsing perspective should these be consistent about the usage of
":" to delineate keys vs values? I.e.:

cxl_poison_list: memdev:mem3 source:VENDOR start:0x707 length:0x8

...that way if this format changes tooling will be prepared to tolerate
new keys injected at any position. I have not looked at how
libtraceevent identifies fields and backwards compatibility.

>
> [1]: https://www.computeexpresslink.org/download-the-specification
>
> Alison Schofield (3):
> trace, cxl: Introduce a TRACE_EVENT for CXL Poison Records
> cxl/mbox: Add GET_POISON_LIST mailbox command support
> cxl/core: Add sysfs attribute get_poison for list retrieval
>
> Documentation/ABI/testing/sysfs-bus-cxl | 13 +++++
> drivers/cxl/cxlmem.h | 43 ++++++++++++++
> include/trace/events/cxl.h | 60 ++++++++++++++++++++
> drivers/cxl/core/mbox.c | 75 +++++++++++++++++++++++++
> drivers/cxl/core/memdev.c | 32 +++++++++++
> 5 files changed, 223 insertions(+)
> create mode 100644 include/trace/events/cxl.h
>
>
> base-commit: 2263e9ed65887cc7c6e977f372596199d2c9f4af
> --
> 2.31.1
>