Re: [RFC EDAC/GHES] edac: lock module owner to avoid error reportconflicts

From: Borislav Petkov
Date: Thu Nov 01 2012 - 18:02:42 EST


On Thu, Nov 01, 2012 at 09:09:07PM +0000, Luck, Tony wrote:
> > That is correct, unfortunately. That information is not available to
> > software in all cases. Maybe APEI could be used for that DIMM location
> > mapping through simple tables instead of letting it fumble the error
> > handling path.
>
> Not much hope for "simple"[1] tables. There is also a timings issue on
> system with rank sparing, memory mirroring etc. ... you need to decode
> to the DIMM at the time the error happened. If you wait until later, then
> the system may have switched over to the spare rank or mirror ... and then
> your decode will point at the new target, rather than the old.

Yeah, normally we're decoding the error right after being logged so...

> [1] Consider a 4 cpu-socket machine with 4 channels per socket and three
> DIMMs per channel - so there are 48 sockets on the motherboard. Then

You mean 48 DIMM slots, right?

> some lab monkey takes a box of random 1, 2, 4, 8 GB DIMMs and fills
> most of the sockets. BIOS will somehow make sense out of this and
> interleave where it finds matching speeds across pairs/quads of
> channels (though size need not match ... if you have a 2G and 4G DIMM
> you may get interleaving for the part. then non-interleaved for the
> "extra" 2G).

Right, but at least in the csrow case, we still can compute back the
csrow even with the interleaving, after we know how it is done exactly
(on which address bits, etc). I think this should be doable on Intel
controllers too but I don't know.

--
Regards/Gruss,
Boris.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/