Re: Linux & ECC memory

Leonard N. Zubkoff (lnz@dandelion.com)
Thu, 14 Nov 1996 17:46:30 -0800


Date: Thu, 14 Nov 1996 19:59:36 -0500 (EST)
From: Kenneth Albanowski <kjahds@kjahds.com>

On Thu, 14 Nov 1996, Steve VanDevender wrote:

> > This is what I'm curious about. Does Linux's NMI code attempt to work
> > around some memory problems, or does it just panic?
> It's necessary to have access to the additional bits used for ECC in
> order to attempt correction in software. I don't know of any systems
> that let you have access to the parity bit on a byte at the software
> level. ECC needs at least three extra bits per word to correct
> single-bit and detect double-bit errors.

That wasn't what I meant. (Sorry, I really should try and clearer.)

I realize Linux itself cannot (and should) not do any sort of error
correction or detection. That is the role of the hardware. But once the
ECC hardware has been triggered, the failing bit(s) will be corrected, or
an NMI (good old "parity error") will be generated indicating a memory
fault.

Once the NMI has occured, can Linux attempt to localize the memory fault,
and work around it, at the very least by trying paging in the affected
page?

Albert Calahan just sent me some mail saying the hardware doesn't report
the failed memory location when the NMI is triggered, so that would answer
my question -- Linux can't attempt to ammeliorate an error, as it doesn't
know where it happened.

A more subtle issue is whether the ECC memory controller could report
instances where ECC detection and successful correction took place. It
would seem to be useful to provide a way for the OS to recognize that
non-fatal memory errors have occured, even though they were repaired.

I believe the Machine Check Architecture implemented in the P6 and P5 CPUs is
what needs to be looked into. If the Machine Check Exception is enabled,
information is placed into special registers detailing memory errors that have
occurred.

Leonard