Re: Hardware Error Kernel Mini-Summit

From: Tony Luck
Date: Mon Jun 14 2010 - 17:34:35 EST


On Mon, Jun 14, 2010 at 1:36 PM, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> On Mon, Jun 14, 2010 at 01:06:59PM -0700, Eric W. Biederman wrote:
>> Displaying the fact that ECC is turned on in the hardware is one
>> of the more interesting bits.  That at least allows you to verify
>> that things are working.
>
> There are hundreds to thousands of BIOS level hardware knobs for memory
> configuration (and if you count all BIOS knobs for everything far more)
>
> Why do you want to check a single bit only? (which is actually not
> a single bit but also a lot of different ways to set this)

There was a case mentioned at the collaboration summit
meeting where a BIOS bug mis-reported whether ECC was
enabled - claiming it was on, when in fact it was off.

Error injection could be used to check for another instance
of a lying BIOS (inject an error - make sure it gets counted).
Not as direct as seeing that the right bits are enabled in the
memory controller configuration registers, but still effective.
Perhaps more so as this technique validates different pieces
of the chipset specific code against each other. An EDAC
driver that tells you that ECC is enabled might be lying too,
if it is looking at the wrong bit or the wrong register.

-Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/