Re: [RCF] Linux memory error handling

From: Wang, Zhenyu
Date: Wed Jun 15 2005 - 22:02:36 EST


On 2005.06.15 09:30:13 +0000, Russ Anderson wrote:
> [RCF] Linux memory error handling.
>
> Summary: One of the most common hardware failures in a computer
> is a memory failure. There has been efforts in various
> architectures to support recover from memory errors. This
> is an attempt to define a common support infrastructure
> in Linux to support memory error handling.
>
> Background: There has been considerable work on recovering from
> Machine Check Aborts (MCAs) in arch/ia64. One result is
> that many memory errors encountered by user applications
> not longer cause a kernel panic. The application is
> terminated, but linux and other applications keep running.
> Additional improvements are becoming dependent on mainline
> linux support. That requires involvement of lkml, not
> just linux-ia64.

Good RFC! Actually on x86 arch, 'bluesmoke' - http://bluesmoke.sf.net - is out
there for some simple mem ECC error handling already. It's inspired by the old linux-ecc
project. Current capability is limited to detect, report, configuable for polling and UE
panic.

Bluesmoke contains a driver core which is used to host infos for each mem
controller, like dimm info, and currently only polling method is taken for registered
controller. Others are all the specific chipset drivers, which is mostly platform depend,
e.g e7520, 82875P, etc. Those platforms have also been tested, bluesmoke's webpage
contains some test method if you really want to try.

nmi handling is still under work, Dave and Corey's patch is on sourceforge page, and

http://lkml.org/lkml/2004/8/19/140
http://lkml.org/lkml/2005/3/21/11

Those nmi callbacks have not been added to chipset driver yet, but some initial
testing failed, still don't know why...

thanks
-zhen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/