Re: Linux & ECC memory

Albert Cahalan (albert@ccs.neu.edu)
Thu, 14 Nov 1996 22:47:47 -0500 (EST)


>>> Albert Calahan just sent me some mail saying the hardware
>>> doesn't report the failed memory location when the NMI is
>>> triggered, so that would answer my question -- Linux can't
>>> attempt to ammeliorate an error, as it doesn't know where
>>> it happened.
>
> Wouldn't linux know which process was active (and generated)
> the NMI though?\ I would think that the kernel could at least
> kill the process and unmap the physical pages used by that
> process at the time.
>
> This would a) keep the system running and b) provide at least
> some indication of where the error is (although memtest86 will
> be more useful in this regard)

The kernel could record what pages are in use by the current
process, plus the previous process if the current process was
just scheduled. It is best to just printk() the address space.

After several NMIs have happened, the sysadmin can use a
statistical tool can examine the log file for patterns.

If a pattern shows up, then some of the worst pages can be
avoided. It would be easy to add a command line option to
avoid bad pages, but it is better if there is no need to reboot.
Then klogd could automate the whole solution.

>> I believe the Machine Check Architecture implemented in the P6
>> and P5 CPUs is what needs to be looked into. If the Machine
>> Check Exception is enabled, information is placed into special
>> registers detailing memory errors that have occurred.
>
> Of course, this would be even better :-)

I will guess: That information is in appendix H, and it has
not been reverse engineered for www.x86.org yet.