Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

From: huang ying
Date: Fri May 13 2011 - 20:20:46 EST


On Fri, May 13, 2011 at 9:51 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Fri, May 13, 2011 at 09:17:13PM +0800, huang ying wrote:
>> Hi, Don,
>>
>> On Fri, May 13, 2011 at 8:45 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
>> > On Fri, May 13, 2011 at 04:23:38PM +0800, Huang Ying wrote:
>> >> In general, unknown NMI is used by hardware and firmware to notify
>> >> fatal hardware errors to OS. So the Linux should treat unknown NMI as
>> >> hardware error and go panic upon unknown NMI for better error
>> >> containment.
>> >
>> > I have a couple of concerns about this patch. ÂOne I don't think BIOSes
>> > are ready for this. ÂI have Intel Westmere boxes that say they have a
>> > valid HEST, GHES, and EINJ table, but when I inject an error there is no
>> > GHES record. ÂThis leaves me with an unknown NMI and panic. ÂYeah, it is a
>> > BIOS bug I guess, but I think vendors are going to be slow fixing all this
>> > stuff (my Nehalem box is in even worse shape with this stuff).
>>
>> Although there is no GHES record, I think the Westmere box behavior is
>> acceptable, an unknown NMI is used by BIOS to notify hardware error,
>> this is what we want to deal with in this patch.
>
> I don't think having HEST changes the situation. ÂI agree with your
> statement above, but I can also generate unknown NMIs from stressing perf.

Yes. perf can still generate unknown NMIs. Maybe we should turn off
panic on unknown NMI logic if perf is running. Maybe add warning to
users that if you use perf, you may lose some RAS feature.

> Broken hardware usually generated NMIs, sometimes they propogated to the
> cpu, other times, the were swallowed by the chipset. ÂWhich means having
> HEST or not having HEST doesn't improve anything nor make it any worse.
>
> IOW I don't think we gain anything with this patch.

Without this patch, a real fatal hardware error may silently ruin your
disk data. But with this patch, you can panic before that. I think
this is what we gain with this patch.

>>
>> > Also, is there any known issues with x86_64 platforms with bad NMIs? ÂRHEL
>> > has had unknown NMI's panic on x86_64 since x86_64 first came out, I don't
>> > recall any exceptions we had to add to handle 'quirky' hardware.
>> >
>> > Then for the i686 case, because the 'quirky' hardware is so old, can't we
>> > just leave it a kernel config option to switch between using a 'printk'
>> > vs. a 'panic'? ÂOr even a kernel command line option.
>> >
>> > I figure these 'quirky' hardware machines are more the exception nowdays,
>> > do we really need to add code to whitelist machines?
>> >
>> > Granted I am not familiar enough with the quirky hardware (in fact I don't
>> > think I have seen any mainly because I haven't been around long enough).
>> > Most cases I see when trolling through the fedora bugzilla list for
>> > unknown NMIs, is just bad firmware or acpi power configurations.
>> >
>> > Just wondering if we could just simplify the patch somehow with better
>> > assumptions.
>>
>> So there is still unknown NMIs on real hardware now. I am afraid turn
>> on panic on unknown NMI by default may be not acceptable for someone.
>
> The opposite could be said too. ÂI think that was Ingo's point. ÂThe
> policy should be left in the hands of the user or distro because there is
> no right answer.

IMHO, Linux is not X, so Linux kernel will not push all policy to user
space. And for fatal hardware error processing, there may be no
opportunity for user space to run.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/