Re: [RFC PATCH 1/9] cxl/mem: Implement Get Event Records command

From: Ira Weiny
Date: Tue Sep 20 2022 - 18:10:40 EST


On Tue, Sep 20, 2022 at 01:23:29PM -0700, Jiang, Dave wrote:
>
> On 9/20/2022 8:49 AM, Jonathan Cameron wrote:
> > On Fri, 9 Sep 2022 13:53:55 -0700
> > Ira Weiny <ira.weiny@xxxxxxxxx> wrote:
> >
> > > On Thu, Sep 08, 2022 at 01:52:40PM +0100, Jonathan Cameron wrote:
> > > [snip]
> > >
> > > > > > > diff --git a/include/trace/events/cxl-events.h b/include/trace/events/cxl-events.h
> > > > > > > new file mode 100644
> > > > > > > index 000000000000..f4baeae66cf3
> > > > > > > --- /dev/null
> > > > > > > +++ b/include/trace/events/cxl-events.h
> > > > > > > @@ -0,0 +1,127 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +#undef TRACE_SYSTEM
> > > > > > > +#define TRACE_SYSTEM cxl_events
> > > > > > > +
> > > > > > > +#if !defined(_CXL_TRACE_EVENTS_H) || defined(TRACE_HEADER_MULTI_READ)
> > > > > > > +#define _CXL_TRACE_EVENTS_H
> > > > > > > +
> > > > > > > +#include <linux/tracepoint.h>
> > > > > > > +
> > > > > > > +#define EVENT_LOGS \
> > > > > > > + EM(CXL_EVENT_TYPE_INFO, "Info") \
> > > > > > > + EM(CXL_EVENT_TYPE_WARN, "Warning") \
> > > > > > > + EM(CXL_EVENT_TYPE_FAIL, "Failure") \
> > > > > > > + EM(CXL_EVENT_TYPE_FATAL, "Fatal") \
> > > > > > > + EMe(CXL_EVENT_TYPE_MAX, "<undefined>")
> > > > > > Hmm. 4 is defined in CXL 3.0, but I'd assume we won't use tracepoints for
> > > > > > dynamic capacity events so I guess it doesn't matter.
> > > > > I'm not sure why you would say that. I anticipate some user space daemon
> > > > > requiring these events to set things up.
> > > > Certainly a possible solution. I'd kind of expect a more hand shake based approach
> > > > than a tracepoint. Guess we'll see :)
> > > Yea I think we should wait an see.
> > >
> > > > > > > + { CXL_EVENT_RECORD_FLAG_PERF_DEGRADED, "Performance Degraded" }, \
> > > > > > > + { CXL_EVENT_RECORD_FLAG_HW_REPLACE, "Hardware Replacement Needed" } \
> > > > > > > +)
> > > > > > > +
> > > > > > > +TRACE_EVENT(cxl_event,
> > > > > > > +
> > > > > > > + TP_PROTO(const char *dev_name, enum cxl_event_log_type log,
> > > > > > > + struct cxl_event_record_raw *rec),
> > > > > > > +
> > > > > > > + TP_ARGS(dev_name, log, rec),
> > > > > > > +
> > > > > > > + TP_STRUCT__entry(
> > > > > > > + __string(dev_name, dev_name)
> > > > > > > + __field(int, log)
> > > > > > > + __array(u8, id, UUID_SIZE)
> > > > > > > + __field(u32, flags)
> > > > > > > + __field(u16, handle)
> > > > > > > + __field(u16, related_handle)
> > > > > > > + __field(u64, timestamp)
> > > > > > > + __array(u8, data, EVENT_RECORD_DATA_LENGTH)
> > > > > > > + __field(u8, length)
> > > > > > Do we want the maintenance operation class added in Table 8-42 from CXL 3.0?
> > > > > > (only noticed because I happen to have that spec revision open rather than 2.0).
> > > > > Yes done.
> > > > >
> > > > > There is some discussion with Dan regarding not decoding anything and letting
> > > > > user space take care of it all. I think this shows a valid reason Dan
> > > > > suggested this.
> > > > I like being able to print tracepoints with out userspace tools.
> > > > This also enforces structure and stability of interface which I like.
> > > I tend to agree with you.
> > >
> > > > Maybe a raw tracepoint or variable length trailing buffer to pass
> > > > on what we don't understand?
> > > I've already realized that we need to print all reserved fields for this
> > > reason. If there is something the kernel does not understand user space can
> > > just figure it out on it's own.
> > >
> > > Sound reasonable?
> > Hmm. Printing reserved fields would be unusual. Not sure what is done for similar
> > cases elsewhere, CPER records etc...
> >
> > We could just print a raw array of the whole event as well as decode version, but
> > that means logging most of the fields twice...
> >
> > Not nice either.
> >
> > I'm a bit inclined to say we should maybe just ignore stuff we don't know about or
> > is there a version number we can use to decide between decoded vs decoded as much as
> > possible + raw log?

I'm not a fan of loging the raw + decoded versions.

>
> libtraceevent can pull the trace event data structure fields directly. So
> the raw data can be pulled directly from the kernel.

This raw data needs to be in a field though. If the kernel does not save the
reserved fields in the TP_fast_assign() then the data won't be in a field to
access.

>
> And what gets printed
> to the trace buffer can be decoded data constructed from those fields by the
> kernel code. So with that you can have access both.
>

Fast assigning the entire buffer + decoded versions will roughly double the
trace event size.

Thinking through this a bit more there is a sticking point.

The difficulty will be ensuring that any new field names are documented such
that when user space starts to look at them they can determine if that data
appears as a new field or as part of a reserved field.

For example if user space needs to access data in the reserved data now it can
simply decode it. However, when that data becomes a field it no longer is part
of the reserved data. So what user space would need to do is look for the
field first (ie know the field name) and then if it does not appear extract it
from the reserved data.

I'm now wondering if I've wasted my time decoding anything since the kernel
does not need to know anything about these fields. Because the above scenario
means that user space may get ugly over time.

That said I don't think it will present any incompatibilities. So perhaps we
are ok?

Ira