Re: Extended H/W error log driver

From: Borislav Petkov
Date: Tue Oct 15 2013 - 05:29:23 EST


On Tue, Oct 15, 2013 at 12:07:31AM -0400, Chen Gong wrote:
> Some errors have multiple sub sections like below:
>
> [ 1442.070522] {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> [ 1442.070528] {2}[Hardware Error]: event severity: corrected
> [ 1442.070531] {2}[Hardware Error]: sub_event[0], severity: corrected
> [ 1442.070534] {2}[Hardware Error]: section_type: memory error
> [ 1442.070537] {2}[Hardware Error]: error_status: 0x0000000000000000
> [ 1442.070539] {2}[Hardware Error]: sub_event[1], severity: corrected
> [ 1442.070541] {2}[Hardware Error]: section_type: memory error
> [ 1442.070543] {2}[Hardware Error]: error_status: 0x0000000000000000

Right, and what do those sub sections mean to the user? Did we have
multiple errors?

It looks like this because we have memory errors section type but it is
not very telling. How about:


[ 1442.070522] {2}[Hardware Error]: APEI GHES id 0: Hardware errors logged
[ 1442.070528] {2}[Hardware Error]: event severity: corrected
[ 1442.070534] {2}[Hardware Error]: Error 0, type: corrected memory error.
[ 1442.070537] {2}[Hardware Error]: error_status: 0x0000000000000000
[ 1442.070539] {2}[Hardware Error]: Error 1, type: corrected memory error.
[ 1442.070543] {2}[Hardware Error]: error_status: 0x0000000000000000

I think this is much more human readable and understandable :-)

We can even add a hint for the user like:

"Above errors have been corrected by the hardware and require no further action."

Btw, this is valid for both dmesg and trace event output.

Because from my experience so far people just scream: "Look, I just had
an MCE" withot even reading what it says. And this just upsets support
people for no valid reason at all.

Thanks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/