Re: Hardware Error Kernel Mini-Summit

From: Nils Carlson
Date: Tue Jun 15 2010 - 08:22:59 EST


Hi Andi,

On Tue, 15 Jun 2010, Andi Kleen wrote:

> On Tue, Jun 15, 2010 at 10:06:33AM +0200, Nils Carlson wrote:
>
> Hi Nils,
>
> > Could you maybe provide some references on how DIMM layout
> > could be read from DMI? I can't find anything nearly this specific,
> > or is it something we're expecting to happen in future BIOS's?
>
> The hardware (or BIOS) tells you the DIMM. You read the DIMMs
> from DMI and map them using the locators. The locator strings
> are not standardized, but there are not too many different
> formats around, so they can be implemented.
>
> Again this does not give you full layout, but it gives
> you a "path to a DIMM" and a DIMM locator.

Hmm.. From having a quick look at our boards I can conclude
that the information our BIOS puts in their is useless.
Will discuss durther with our BIOS writers. They do
their own error detection during the boot in which they
decode to DIMM's, so obviously the information is in there
(somewhere).

> An alternative is also to use the ACPI based reporting
> mechanism which is needed on some system. In this case
> the CPER gives you a reference to the DMI object of the DIMM.
>
> In principle DMI has more information (arrays, ranges etc.)
> but in my experience that is not strong enough to really find
> the DIMM on modern systems. You need hardware or BIOS help for this.

So what are we left with? Non-standardised locator strings
that may or may not be present at the mercy of the bios-writer?
I'm already feeling depressed. Re-writing EDAC to try to
make sense of this information seems overly risky.

I think in general that this is one of the wonderfull things
about linux, you're not so much at the mercy of BIOS-writers.
As soon as we start relying on the BIOS for functionality we're
encouraging the BIOS people to put more functionality in there,
and BIOS functionality is great, as long as there are no bugs!

But there are bugs. And correcting them is so prohibitively
expensive that I don't even want to think about it. And when
the BIOS messes up, it's the device driver writers who have to
magically workaround the problems.

Could we come up with some plan that doesn't involve
trusting to the goodwill (and competence) of BIOS writes?

I personally really like the device tree compiler for PowerPC.
It allows you to be explicit about what you have. Not for everyone,
but maybe there could be some way to apply the same principle? Maybe
some way of loading modules with parameters or configuring your setup
from sysfs?

/Nils
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/