Re: NMI trap and 2.1.37-7

Gabriel Paubert (paubert@iram.es)
Wed, 4 Jun 1997 01:58:34 +0200 (METDST)


On 14 May 1997, Linus Torvalds wrote:

> As such I wasn't really interested in the memory parity NMI at all,
> especially as that NMI is useless anyway (you can't get any information
> on what part of memory is bad).
>
> If people are really intersted in memory parity checking, they should
> look into the Machine Check Architecture supported in newer CPU's and
> use ECC RAM. The Machine Check Architecture (MCA - I just bet intel did
> that just to mess with the minds of people who have a MCA bus) is a lot
> more useful when it comes to memory parity errors than the NMI line ever
> was.

Sorry to be so late in answering, I've been very busy on other things
while collecting some data about chipsets.

Using the machine check would be elegant if it could ever work. But
besides the fact that machine checks may not be restartable, which is
unacceptable for recoverable ECC errors, the truth is that Intel chipsets
are designed to prevent from using it effectively. Basically the machine
check exception will only happen if there is a parity and/or ECC error
during transfers between the chipset and the processor, in which case you
should truly worry...

Only the 450GX supports this feature because all other chipsets do not
have the additional pins on data bus between the processor and the memory
controller. So it's better to disable machine checks altogether...

Now, take as an example the possible policy for ECC errors with the 440FX
PPro chipset. You can get an NMI for a correctable and/or uncorrectable
ECC error, but the only way to correct the error is to

read and rewrite in place the whole memory (!),

which will take several seconds on a large system, so you would have to
split it to scan for example a few pages on every timer interrupt.

The situation is somewhat better with the 450GX/KX, this one has got
registers (in the PCI configuration space) which record the address
of the failing memory (not exactly because of possible memory gaps).
And if properly cabled these can generate a normal interrupt to the APIC for
recoverable errors (there is a dedicated pin for it).

Cheers,
Gabriel.