Re: [RFC/Requirements/Design] h/w error reporting

From: Ingo Molnar
Date: Wed Nov 10 2010 - 05:15:35 EST



* Luck, Tony <tony.luck@xxxxxxxxx> wrote:

> Taking a cue from the tracing session from the previous day (where the "perf" vs.
> "ftrace" vs. "lttng" war was ended by proposing a new tracing methodology that
> would overcome the shortcomings of both of the merged subsystems while also
> addressing the requirements of the lttng users) [...]

Well, the direction is that we are unifying ftrace and perf events and we are
actively phasing out individual ftrace plugins as matching events become available
(we already removed a few).

Most new tools use the perf syscall and tool writers have expressed the very
understandable desire that all events (and their reporting facility) be enumerated
and accessible via a unified API/ABI.

While it often seems easier for subsystems to just do their own ad-hoc
logging/reporting in the short run (every subsystem tends to think it has its own
very specific requirements for logging - while users/tool-authors can only shake
their head in disbelief when looking at the myriads of incompatible and inconsistent
facilities). The tooling requirement for unification is strong here and can not be
ignored.

> [...] we explored whether the solution would be to define a new "system health"
> subsystem that could be used by any part of the kernel to report hardware issues
> in a coherent way so that end users would have a single place to look for all
> error information.

Note that Boris has been working on extending perf events into this area as well,
see this recent submission of patches on lkml:

[PATCH 20/20] ras: Add RAS daemon

One thing is clear: any 'health subsystem' should not do its own flavor of error
reporting - instead we want to unify various forms of event logging into a common
facility.

RAS/EDAC could do its own hardware-specific settings via a separate subsystem -
although even many of those can be expressed via their respective events. (and we
are open on the perf events side to give callbacks/facilities for such use)

The synergies of unified event reporting are very strong.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/