Re: NMI errors in 2.0.30??

Leonard N. Zubkoff (lnz@dandelion.com)
Sun, 27 Apr 1997 09:56:31 -0700


Date: Sat, 26 Apr 1997 21:40:44 -0400 (EDT)
From: "Richard B. Johnson" <root@analogic.com>

The kernel doesn't "know" anything about an ECC mode in the BIOS. The
kernel presumes that all RAM found is good and whatever is written to
the RAM can be read back exactly as written.

Given that, a memory controller chip may detect a RAM parity error or
the inability to correct a RAM error if it handles ECC, i.e., correctable
errors. When it detects such an error, it signals the CPU via the non-
maskable interrupt. Since the CPU can not do anything about a RAM error
that has occurred, software can do different things once such an interrupt
occurs. Windoze 95 issues an "inrecoverable error" message and prompts
the user to "Continue or Reboot". NT just presumes the user is dumb and
reboots. Linux knows that there isn't anything it can do about the
problem and just issues an error message and continues. MS-DOS just
ignores the problem unless a memory manager is installed. If the memory
manager is installed, it clears the screen, makes some dumb message
about "protecting you", then waits for a keypress before it reboots the
system.

In every case, there isn't really anything that the operating system
can do to "recover" from a RAM error. In some machines like VAXen,
the kernel will map out any bad RAM found. The task that was using
this RAM gets killed, but the system continues. This area of RAM
will not be reused until the system is rebooted. VAXen use 512-byte
pages.

I believe that if you look into the Pentium and Pentium Pro Machine Check
Architecture and the documentation for the 430HX and 440FX chipsets, you'll
find that there is support for determining the addresses where parity errors
and ECC errors occur. There are ways the system can recover if the memory
that's suspect is holding "clean" data, for example a cached copy of disk data.
In that case, re-reading the data from disk is an option.

Leonard