Re: x86/mce merge, integration hickup + crash, design thoughts

From: Andi Kleen
Date: Tue Jan 13 2009 - 00:02:21 EST


Tim Hockin wrote:


- it squeezes all MCE errors from the whole system into a small,
32-entry ringbuffer.

Yes 32 is small, but not really a big problem in practice. The MCE daemon
(http://mcedaemon.googlecode.com)

Interesting.

goes a long way towards making this
a non-issue. Every distro should ship mced.

The latest mcelog version (not yet released) also supports a daemon
mode btw. But the limited buffer has been also fixed already
here and replaced with more flexible per CPU buffers.

- it puts all the MCE logging info into an intermediary binary log
record format: 'struct mce' - just for userspace to in essence
printf out those entries with minimal post-processing. The fact that
we squeeze all information into a fixed-size binary record makes it
hard to extend and complicates the code needlessly.

This is not true, we do EXTENSIVE post-processing of MCEs. Yes it is hard
to extend, but that's not the same as saying it is useless.

It's not hard to extend at all (I did it several times)
Adding fields is extremly simple and fully forwards and
backwards compatible. That works because mcelog asks
the kernel about the record size and just ignores any
excess data.

The only thing that cannot be done is removing old fields, but
that is difficult with ASCII parsers too (text parsers tend
to bail out when they can't find some field they expect)

They can be obsoleted fine however (happened with cpu -> extcpu)


- these design aspects are also quite harmful to usability: by
default all MCEs are fatal currently (pre-Nehalem anyway), so
/dev/mcelog will only be used if a user goes out on a limb to
configure it and sets the tolerant flag.

Not true. The vast vast vast majority of MCEs are corrected, as per our
experience, and I suspect we've got more experience with MCEs than just
about any other consumer.

My employer has a lot of experience with MCEs too... And it matches your
experience.

-Andi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/