Re: NMI error and Intel S5000PSL Motherboards

From: AndrewL733
Date: Fri Sep 28 2007 - 11:22:20 EST


rdunlap@xxxxxxxxxxxx wrote:
On Wed, 26 Sep 2007 19:48:14 -0400 Jim Paris wrote:

Hello,

We have about 100 servers based on Intel S5000PSL-SATA motherboards. They have been running for anywhere between 1 and 10 months. For the past few months, after updating them all to the 2.6.20.15 kernel (because of a bug in the 2.6.18 kernel), we are seeing some strange NMI errors. For example:

Aug 29 09:02:10 master kernel: Uhhuh. NMI received for unknown reason 30.
Aug 29 09:02:10 master kernel: Do you have a strange power saving mode enabled?
Aug 29 09:02:10 master kernel: Dazed and confused, but trying to continue
I'm also working with Andrew and Samson. It seems that the cause of
the problem is CONFIG_PCIEAER, which was introduced after 2.6.18 and
defaults to y.

With CONFIG_PCIEAER=n, scanpci works fine with no errors. This is the
workaround that they'll likely use for now.

Glad that you found it.
With CONFIG_PCIEAER=y, scanpci always triggers the NMI error. The
option aerdriver.forceload=1 has no effect.

So, looking for some closure here, what do we think is the "root cause"? Is it:

1) a defect with Intel's S5000PSL motherboards that is exposed by an otherwise fine new (since 2.6.19) Linux kernel feature? (in which case we and others should probably press Intel to recognize they have a problem, seeing as they only "officially support" distributions running on 2.6.16 or below so maybe they don't even know about this issue).

2) a problem with PCIEAER? And maybe "CONFIG_PCIEAER=y" should NOT be the default setting? (in which case the kernel maybe needs fixing)

3) just a bad interaction between a good motherboard and a good Linux feature that don't play well together? (in which case this is a kernel "feature" that anybody compiling a kernel to run on the Intel S5000PSL motherboard should know not to enable -- maybe a note is warranted so that when configuring the kernel, people with S5000PSL motherboards might not make the same mistake???).


The 'forceload' option only forces the driver to load even when the
ACPI hardware initialization routine fails.

It would be nice to be able to disable PCIEAER at boot time though.
Shouldn't be difficult.


The related dmesg output at boot is:

Evaluate _OSC Set fails. Status = 0x0005
Evaluate _OSC Set fails. Status = 0x0005
aer_init: AER service init fails - Run ACPI _OSC fails
aer: probe of 0000:00:02.0:pcie01 failed with error 2
aer_init: AER service init fails - No ACPI _OSC support
aer: probe of 0000:00:03.0:pcie01 failed with error 1
Evaluate _OSC Set fails. Status = 0x0005
Evaluate _OSC Set fails. Status = 0x0005
aer_init: AER service init fails - Run ACPI _OSC fails
aer: probe of 0000:00:04.0:pcie01 failed with error 2
Evaluate _OSC Set fails. Status = 0x0005
Evaluate _OSC Set fails. Status = 0x0005
aer_init: AER service init fails - Run ACPI _OSC fails
aer: probe of 0000:00:05.0:pcie01 failed with error 2
Evaluate _OSC Set fails. Status = 0x0005
Evaluate _OSC Set fails. Status = 0x0005
aer_init: AER service init fails - Run ACPI _OSC fails
aer: probe of 0000:00:06.0:pcie01 failed with error 2
aer_init: AER service init fails - No ACPI _OSC support
aer: probe of 0000:00:07.0:pcie01 failed with error 1

Full dmesg, lspci, and ACPI DSDT are available here:
http://jim.sh/~jim/tmp/nmi/

-jim


---
~Randy
Phaedrus says that Quality is about caring.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/