Re: RAS trace event proto

From: Borislav Petkov
Date: Wed Feb 22 2012 - 05:43:49 EST


On Wed, Feb 22, 2012 at 12:58:37AM +0000, Luck, Tony wrote:
> I'm also struggling to understand an end-user use case where you would
> want filtering. Mauro - can you expand a bit on why someone would just
> want to see the errors from memory controller 1?
>
> My mental model of the world is that large systems have some background
> noise - a trickle of corrected errors that happen in normal operation.
> User shouldn't care about these errors unless they breach some threshold.
>
> When something goes wrong, you may see a storm of corrected errors, or
> some uncorrected errors. In either case you'd like to get as much information
> as possible to identify the component that is at fault. I'd definitely like
> to see some structure to the error reporting, so that mining for data patterns
> in a storm isn't hideously platform dependent.

Yep, I'm on the same page here.

> It might be easier to evaluate the competing ideas here with some sample
> output in addition to the code.

Well, to clarify:

When you get a decoded error, you get the same format as what you get in
dmesg, for example:

[ 2666.646070] [Hardware Error]: CPU:64 MC1_STATUS[-|CE|MiscV|PCC|-|CECC]: 0x9a05c00007010011
[ 2666.655003] [Hardware Error]: Instruction Cache Error: L1 TLB multimatch.
[ 2666.655008] [Hardware Error]: cache level: L1, tx: INSN

And with the decoded string tracepoint, that thing above is a single
string. If you use trace_mce_record(), you still can get the single
MCE fields which we carry to userspace from struct mce, _in addition_.
The hypothetical problem is for userspace not being able to use the
tracepoint format to parse reported fields easily and in an unambiguous
manner. Instead, it gets a single string which, I admit, is not that
pretty.

Now, the problem is if we want to use a single tracepoint for all errors
- it is unfeasible that any fields sharing can be done there except
maybe the TSC stamp when it happened, the CPU that caught it and etc.
not so important details.

IOW, the error format is different for each error type, almost, and
there's no marrying between them. OTOH, ff we start adding tracepoints
for each error type, we'll hit the other end - bloat. So also a no-no.

Maybe the compromise would be to define a single tracepoint per
_hardware_ error reporting scheme. That is, MCA has an own tracepoint,
PCIE AER has its own error reporting tracepoint, then there's an EDAC
!x86 one which doesn't use MCA for reporting and also any other scheme a
hw vendor would come up with...

This will keep the bloat level to a minimum, keep the TPs apart and
hopefully make all of us happy :).

Opinions?


--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/