Re: [RFC PATCH 0/8] EDAC, mce_amd: Add a tracepoint for the decoded error
From: Ingo Molnar
Date: Thu Jul 27 2017 - 04:39:44 EST
* Borislav Petkov <bp@xxxxxxxxx> wrote:
> On Thu, Jul 27, 2017 at 09:10:34AM +0200, Ingo Molnar wrote:
> > Looks pretty nice to me conceptually. Do you have a couple of examples of
> > real-life events that get logged? It's hard to decode it from the new tracepoint
> > alone.
>
> Here's what comes out in dmesg:
>
> [ 932.370319] mce: [Hardware Error]: Machine check events logged
> [ 932.374474] [Hardware Error]: Corrected error, no action required.
> [ 932.381684] [Hardware Error]: CPU:1 (0:0:0) MC5_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc00410000020f0f
> [ 932.384256] [Hardware Error]: Error Addr: 0x0000000056071033 [Hardware Error]: TSC: 2703436211649
> [ 932.386608] [Hardware Error]: MC5 Error: AG payload array parity error.
> [ 932.388425] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)
>
> (whoops, that TSC thing should be on a new line).
>
> and the TP dumps only the last two lines:
>
> [ 932.386608] [Hardware Error]: MC5 Error: AG payload array parity error.
> [ 932.388425] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (timed out)
>
> but come to think of it, it should dump only the MC? Error line because
> the last line can be easily deduced from the error code. I'll change
> that.
I think the more human readable the raw trace data is, the better - but it should
also have a structure that makes it straightforward to be parsed by tooling.
> Btw, the reason why I'm dumping only MC? line is to keep the string
> going into the TP relatively small. It is 128 bytes now. I tried dumping
> the whole decoded string but that easily overflowed 256 bytes and 256
> bytes is already a bit too much to log into the trace buffers.
I don't think so: we routinely log several KB per event worth of call chains via
perf tracing just fine.
So I'd suggest logging more than less, and making it more verbose is definitely
the way to go.
For example I absolutely hate systemd's obscure error patterns and its general
passive-aggressive behavior towards failure analysis and troubleshooting for
example.
Just a random example of systemd (version 232) troubleshooting UI fail:
root@triton:~# systemctl start rsyslog; echo $?
0
root@triton:~# systemctl start rsyslog; echo $?
0
Was the rsyslog service started successfully? Yes/no?
And if I try to stop it:
root@triton:~# systemctl stop rsyslog; echo $?
Warning: Stopping rsyslog.service, but it can still be activated by:
syslog.socket
0
... the text suggests that it got stopped, and the command returns 0 which usually
means success. It warns about some service possible activating it, but it does not
tell whether it actually _is_ active after this command, or got stopped. I mean,
systemd should be very much aware of what the real situation is, and could
volunteer all that helpful info - instead it's being passively obtuse.
Turns out the text is misleading and the service did not get stopped:
root@triton:~# ps aux | grep -i rsys
syslog 27311 0.0 0.0 256416 3256 ? Ssl 10:32 0:00 /usr/sbin/rsyslogd -n
Or another UI idiocy:
root@triton:~# systemctl -v
systemctl: invalid option -- 'v'
root@triton:~# systemctl --version
systemd 232
+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD +IDN
i.e. the systemd developers implemented --version, but they did not do the trivial
mapping of it to the customary (and unused) -v alias...
etc. etc.
So let's not repeat these kinds of mistakes... be as informative and helpful to
the user as programmatically possible.
Thanks,
Ingo