Re: [PATCH] Add mem_nmi_panic enable system to panic on hard error

From: Alan Cox
Date: Tue Dec 13 2005 - 07:23:02 EST


On Maw, 2005-12-13 at 01:48 -0500, Dave Jones wrote:
> On Tue, Dec 13, 2005 at 03:38:16PM +0900, Hidetoshi Seto wrote:
> > Some x86 server fires NMI with reason 0x80 up if a parity error
> > occurs on its PCI-bus system or DIMM module.
>
> Hmm, are you sure this isn't a bios error misconfiguring
> some northbridge register perhaps ? Some chipsets offer
> such reporting as a feature. Could be your server has this
> on by default.

This is done deliberately on some systems and is why we have the patches
I submitted to Andrew Morton which are in his tree and allow you to set
"panic_on_unrecovered_nmi" to halt on a memory error.

See the -mm tree. The functionality is there. Also see the various
x86-64 logging tools which can parse out MCE based reports.

> (I believe the EDAC code has also triggered similar cases
> on certain cards which is why it too disables this checking
> by default).

EDAC has it enabled by default at the moment, although it needs to clear
left over flags and possibly a small blacklist first.

> The sysctl seems pointless too. If this is needed at all,
> why would you ever want to turn it off ?

Debugging and stressing systems for one.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/