Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

From: Borislav Petkov
Date: Fri May 11 2012 - 06:25:38 EST


On Thu, May 10, 2012 at 10:48:40PM -0300, Mauro Carvalho Chehab wrote:
> Em 10-05-2012 19:37, Luck, Tony escreveu:
> > kworker/u:6-201 [007] .N.. 186.197280: mc_error: [Hardware Error]: mem_ctl#0: Corrected error memory read error on memory stick "DIMM_A1" (channel:0 slot:1 page:0x2f1eb3 offset:0x446 grain:32 syndrome:0x0 1 error(s): Unknown: Err=0001:0090 socket=0 channel=0/mask=1 rank=5)
> >
> > The word "error" appears *five* times on this line (once with a capital E).
> > I feel beaten, bruised and ready to give up on this machine with just one
> > actual error reported :-)
>
> :)
>
> Several of them come from the driver-provided details.
>
> The edac-mc core contributes with "mc_error", "[Hardware Error]" and "Corrected error".
> The sb-edac driver contributes with "memory read error" and "1 error(s)".
>
> We can get easily get rid of "[Hardware Error]" by removing HW_ERR from:
>
> TP_printk(HW_ERR "mem_ctl#%d: %s error %s on memory stick \"%s\" (%s %s %s)",
>
> replacing mc_error by something else is not hard, but this is the name of the trace call:
>
> TRACE_EVENT(mc_error,
> ...
>
> Maybe the better is to do s/mc_error/mc_event/g.

HW_ERR is the "official" prefix used by the MCE code in the kernel.
Maybe we can shorten it but it is needed to raise attention when staring
at dmesg output.

Now, since this tracepoint is not dmesg, we don't need it there at all
since we know that trace_mc_error reports memory errors.

"mc_error" is also not needed.

> The error count msg ("1 error(s)") could be replaced by "count:1".

Is there even a possibility to report more than one error when invoking
trace_mc_error once? If not, simply drop the count completely.

> > We could get rid of one by:
> > s/Corrected error memory read error/Corrected memory read error/
>
> This is the hardest possible solution ;) Changing it will cause weird messages
> all over EDAC drivers ;)

I agree with Tony here - repeating error a gazillion times on one report
only is a "naaah!"

Here's how it should look:

kworker/u:6-201 [007] .N.. 161.136624: [Hardware Error]: memory read on memory stick "DIMM_A1" (type: corrected socket:0 mc:0 channel:0 slot:0 rank:1 page:0x586b6e offset:0xa66 grain:32 syndrome:0x0 channel_mask:1)

* count is gone
* MC-drivers shouldn't say "error" when reporting an error
* UE/CE moves into the brackets
* socket moves earlier in the brackets, and keep the whole deal hierarchical.
* drop "err_code" what is that?
* drop second "socket"
* drop "area" Area "DRAM" - are there other?
* what is "channel_mask"?
* move "rank" to earlier

Now this is an output format I can get on board with.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/