RE: EDAC (was: Re: 2.6.14-rc5-mm1)

From: Roger Heflin
Date: Thu Oct 27 2005 - 11:43:54 EST






> depends on your requirements.
>
> we have been living with systems with PCI devices for a
> decade now. how many times have events occurred that had no
> explaination and are simply dismissed? There were no detectors.
>
> We assume many things, even today. How many desktops with
> gigs of memory have no ECC? I have learned my lesson while
> refactoring bluesmoke/edac that ECC is very important. ECC
> always in my machines for now on, for me anyway.

The bigger question is how many machines have ecc but no one is
watching it for errors of any sort, the answer for this question
is almost all. On opteron's mcelog does a decent job of alerting
people to memory issues, but not even all Enterprise distributions
include mcelog, but on the Xeons there is almost nothing
warning any of ecc failures, unless someone checks the bios event
logs on the few machinces that work, and this is fairly difficult,
so almost no one does.


>
> For PCI devices, if you want to "know" data is being
> transmitted correctly, then there needs to be "detector" and
> "reporter" and "handler" agents of this bad events to
> properly notice, report and process them.

And there are other PCI cards/motherboards that have timing issues
also, I know because I have personally found and debugged a
couple of them, one was nasty enough that it corrupted data on every
5-10GB of reads, and it was *ALL* motherboards of this type with a
certain given card running at 133mhz. With a different manufacturer's
similar card it crashed with a MTBF of about when one would expect corrupted
data.

Roger

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/