Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

From: huang ying
Date: Fri May 13 2011 - 09:17:19 EST


Hi, Don,

On Fri, May 13, 2011 at 8:45 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Fri, May 13, 2011 at 04:23:38PM +0800, Huang Ying wrote:
>> In general, unknown NMI is used by hardware and firmware to notify
>> fatal hardware errors to OS. So the Linux should treat unknown NMI as
>> hardware error and go panic upon unknown NMI for better error
>> containment.
>
> I have a couple of concerns about this patch. ÂOne I don't think BIOSes
> are ready for this. ÂI have Intel Westmere boxes that say they have a
> valid HEST, GHES, and EINJ table, but when I inject an error there is no
> GHES record. ÂThis leaves me with an unknown NMI and panic. ÂYeah, it is a
> BIOS bug I guess, but I think vendors are going to be slow fixing all this
> stuff (my Nehalem box is in even worse shape with this stuff).

Although there is no GHES record, I think the Westmere box behavior is
acceptable, an unknown NMI is used by BIOS to notify hardware error,
this is what we want to deal with in this patch.

> Also, is there any known issues with x86_64 platforms with bad NMIs? ÂRHEL
> has had unknown NMI's panic on x86_64 since x86_64 first came out, I don't
> recall any exceptions we had to add to handle 'quirky' hardware.
>
> Then for the i686 case, because the 'quirky' hardware is so old, can't we
> just leave it a kernel config option to switch between using a 'printk'
> vs. a 'panic'? ÂOr even a kernel command line option.
>
> I figure these 'quirky' hardware machines are more the exception nowdays,
> do we really need to add code to whitelist machines?
>
> Granted I am not familiar enough with the quirky hardware (in fact I don't
> think I have seen any mainly because I haven't been around long enough).
> Most cases I see when trolling through the fedora bugzilla list for
> unknown NMIs, is just bad firmware or acpi power configurations.
>
> Just wondering if we could just simplify the patch somehow with better
> assumptions.

So there is still unknown NMIs on real hardware now. I am afraid turn
on panic on unknown NMI by default may be not acceptable for someone.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/