Re: [BUGFIX -v7] x86, MCE: Fix bugs and issues of MCE log ringbuffer

From: Ingo Molnar
Date: Fri Sep 18 2009 - 07:10:29 EST



* Huang Ying <ying.huang@xxxxxxxxx> wrote:

> Current MCE log ring buffer has following bugs and issues:
>
> - On larger systems the 32 size buffer easily overflow, losing events.
>
> - We had some reports of events getting corrupted which were also
> blamed on the ring buffer.
>
> - There's a known livelock, now hit by more people, under high error
> rate.
>
> We fix these bugs and issues via making MCE log ring buffer as
> lock-less per-CPU ring buffer.

I like the direction of this (the current MCE ring-buffer code is a bad
local hack that should never have been merged upstream in that form) -
but i'd like to see a MUCH more ambitious (and much more useful!)
approach insted of using an explicit ring-buffer.

Please define MCE generic tracepoints using TRACE_EVENT() and use
perfcounters to access them.

This approach solves all the problems you listed and it also adds a
large number of new features to MCE events:

- Multiple user-space agents can access MCE events. You can have an
mcelog daemon running but also a system-wide tracer capturing
important events in flight-recorder mode.

- Sampling support: the kernel and the user-space call-chain of MCE
events can be stored and analyzed as well. This way actual patterns
of bad behavior can be matched to precisely what kind of activity
happened in the kernel (and/or in the app) around that moment in
time.

- Coupling with other hardware and software events: the PMU can track a
number of other anomalies - monitoring software might chose to
monitor those plus the MCE events as well - in one coherent stream of
events.

- Discovery of MCE sources - tracepoints are enumerated and tools can
act upon the existence (or non-existence) of various channels of MCE
information.

- Filtering support: you just subscribe to and act upon the events you
are interested in. Then even on a per event source basis there's
in-kernel filter expressions available that can restrict the amount
of data that hits the event channel.

- Arbitrary deep per cpu buffering of events - you can buffer 32
entries or you can buffer as much as you want, as long as you have
the RAM.

- An NMI-safe ring-buffer implementation - mappable to user-space.

- Built-in support for timestamping of events, PID markers, CPU
markers, etc.

- A rich ABI accessible over system call interface. Per cpu, per task
and per workload monitoring of MCE events can be done this way. The
ABI itself has a nice, meaningful structure.

- Extensible ABI: new fields can be added without breaking tooling.
New tracepoints can be added as the hardware side evolves. There's
various parsers that can be used.

- Lots of scheduling/buffering/batching modes of operandi for MCE
events. poll() support. mmap() support. read() support. You name it.

- Rich tooling support: even without any MCE specific extensions added
the 'perf' tool today offers various views of MCE data: perf report,
perf stat, perf trace can all be used to view logged MCE events and
perhaps correlate them to certain user-space usage patterns. But it
can be used directly as well, for user-space agents and policy action
in mcelog, etc.

- Significant code reduction and cleanup in the MCE code: the whole
mcelog facility can be dropped in essence.

- (these are the top of the list - there more advantages as well.)

Such a design would basically propel the MCE code into the twenty first
century. Once we have these facilities we can phase out /dev/mcelog for
good. It would turn Linux MCE events from a quirky hack that doesnt even
work after years of hacking into a modern, extensible event logging
facility that uses event sources and flexible transports to user-space.

It would actually be code that is not a problem child like today but one
that we can take pride in and which is fun to work on :-)

Now, an approach like this shouldnt just be a blind export of mce_log()
into a single artificial generic event [which is a pretty poor API to
begin with] - it should be the definition of meaningful
tracepoints/events that describe the hardware's structure.

I'd rather have a good enumeration of various sources of MCEs as
separate tracepoints than some badly jumbled mess of all MCE sources in
one inflexible ABI as /dev/mcelog does it today.

Note, if you need any perfcounter infrastructure extensions/help for
this then we'll be glad to provide that. I'm sure there's a few things
to enhance and a few things to fix - there always are with any
non-trivial new user :-) But heck would i take _those_ forward looking
problems over any of the current MCE design mess, any day of the week.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/