Re: [patch] SMP alternatives

From: Andi Kleen
Date: Thu Nov 24 2005 - 10:37:17 EST


On Thu, Nov 24, 2005 at 08:09:07AM -0700, Eric W. Biederman wrote:
> Andi Kleen <ak@xxxxxxx> writes:
>
> > On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> >> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> >> > What do you need a special driver for if the northbridge just
> >> > can do the scrubbing by itself?
> >>
> >> You need a driver to collect and report all the ECC single bit errors to
> >> the user so that they can decide if they have problem hardware.
> >
> > Assuming the errors are logged to the standard machine check
> > architecture that's already done by mce.c. K8 does that definitely.
> >
> > Take a look at mcelog at some point.
> > Your distro probably already sets it up by default to log to
> > /var/log/mcelog
> >
> >>
> >> EDAC is more than one thing
> >> - Control response to a fatal error
> >> - Report non-fatal events for analysis/user decision making
> >
> > x86-64 mce.c does all that There was even a port to i386 around at some point.
>
> Right on the k8 memory controller there is a lot of overlap,
> with what has already been implemented. For all other x86 memory
> controllers the code is filling a large void.

At least the Lindenhurst (E7205) datasheet says the chipset can trigger
MCEs in the CPU (using a MCEERR# pin). I don't know if it's always
enabled, but the hardware seems to have the capability.
That's the oldest Intel server chipset supported with EM64T CPUs.

The threshold counters are not supported directly only.


> The current k8
> code has been delayed for this reason.
>
> Where the EDAC code goes beyond the current k8 facilities is the
> decode to the dimm level so that the bad memory stick can be
> easily identified.

That would be nice to have agreed. But I don't really know
how to do this without mainboard specific knowledge.
If you have something usable it's best to port it to mce.c
or perhaps mcelog

>
> One of the goals of the EDAC code is to work towards a unified
> kernel architecture for this kind error reporting. Currently every
> architecture (if the error are reported at all) handles this
> differently which makes it very hard to do something sane is
> user space.

There is a clear case for being architecture specific here. Some
architectures - like PPC64 or IA64 - have good firmware support for it, so it's
best to use these facilities. On others like i386 and
x86-64 the x86-64 log architecture is good. I might be a bit
biased but I think it's very good and should be used on i386
at some point too. I don't see any need for more.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/