Re: x86/mce merge, integration hickup + crash, design thoughts

From: Tim Hockin
Date: Mon Jan 12 2009 - 17:03:10 EST


On Sat, Dec 27, 2008 at 7:50 AM, Ingo Molnar <mingo@xxxxxxx> wrote:
>
> We really need to get rid of /dev/mcelog. It's a quirky binary logging
> facility not available on 32-bit on current kernels and it has a couple of
> limitations:

It's not quirky, it's structured. Until and unless there is a better,
structured interface, this needs to stay.

> - it squeezes all MCE errors from the whole system into a small,
> 32-entry ringbuffer.

Yes 32 is small, but not really a big problem in practice. The MCE daemon
(http://mcedaemon.googlecode.com) goes a long way towards making this
a non-issue. Every distro should ship mced.

> - it puts all the MCE logging info into an intermediary binary log
> record format: 'struct mce' - just for userspace to in essence
> printf out those entries with minimal post-processing. The fact that
> we squeeze all information into a fixed-size binary record makes it
> hard to extend and complicates the code needlessly.

This is not true, we do EXTENSIVE post-processing of MCEs. Yes it is hard
to extend, but that's not the same as saying it is useless.

> - these design aspects are also quite harmful to usability: by
> default all MCEs are fatal currently (pre-Nehalem anyway), so
> /dev/mcelog will only be used if a user goes out on a limb to
> configure it and sets the tolerant flag.

Not true. The vast vast vast majority of MCEs are corrected, as per our
experience, and I suspect we've got more experience with MCEs than just
about any other consumer.

> A far more useful design for handling MCE events would be to feed them
> into printk logging. So instead of printing such rather cryptic error
> messages:
>
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 0 BANK 6 MISC 202d ADDR ffeef740
> This is not a software problem!
> Run through mcelog --ascii to decode and contact your hardware vendor
>
> and expecting people to run mcelog, we should print plain-text something
> like:
>
> MCE 0
> HARDWARE ERROR. This is *NOT* a software problem!
> Please contact your hardware vendor
> CPU 1 4 northbridge TSC 89a560bb249
> ADDR 1dfa49690
> Northbridge Chipkill ECC error
> Chipkill ECC syndrome = 2021
> bit46 = corrected ecc error
> bus error 'local node response, request didn't time out
> generic read mem transaction
> memory access, level generic'
> STATUS 9410c00020080a13 MCGSTATUS 0

NO! This moves in the completely wrong direction. This output is just
about impossible to parse sanely. And don't get me started about how much
of a pain it is to encode all of the vendor-centric and per-platform
specifics so that it decodes all the useful bits.

I promise you that there are things that are useful to us that the kernel
will not parse and print.

> straight from the kernel. This means that the MCEs will make a lot more
> sense at a glance - and the user can figure out the suspected trouble
> area, without having to find some other box to run mcelog on, etc. We can
> eliminate the user-space mcelog utility/daemon component altogether - it
> buys us little but needless complexity and inflexibility.

I disagree totally. Putting the logic for decoding MCEs into the kernel
adds a large amount of complex, hard to maintain code precisely where it
is most destructive. Any sort of decoding of this data in-kernel is
probably a mistake (yes, I am looking at you EDAC).

It would be much cleaner to instead make the kernel really good at
publishing errors and let userspace decode them. A naive end user won't
know the difference - they will see a message on the syslog or the
console. Power users will be able to do better things.

> If we want to enable userspace to capture MCE events, then it must be done
> in a way that benefits the whole kernel, not just x86: a structured
> logging facility that is in essence a printk variant and is ASCII driven.

I would be fine with this (though I disagree that it *must* be ASCII). I
don't care how the raw data comes out of the kernel, just that it is the
raw data, not some half-assedly parsed data.

> Such event sources should be discoverable, and only 'aware' printouts
> should go into this new facility (not all printks). Demultiplexing should
> be easy and well-defined.

I agree 1000% with this - it is something I have wanted for a long time,
and that some of my team are starting to look at again.

> I.e. we could use this opportunity of the MCE code unification to bring
> the code to the next level - and not prolongue to broken concepts of the
> past.

Don't dicard the past until the present can stand on it's own. Please.

> I'd be glad to help out with any portion of this, it should be easy to
> solve and it will clearly improve the code. For .29 we could just do a raw
> printk based approach with no decoding just yet, and layer smart decoding
> and structured logging for .30.

Please leave /dev/mcelog as a CONFIG option until such time as an
appropriate replacement is *done*. We rely on it.

Tim
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/