Re: [RFC 1/6] x86, NMI, Add symbol definition for NMI magic constants

From: huang ying
Date: Fri Sep 24 2010 - 07:50:22 EST


On Thu, Sep 23, 2010 at 10:16 PM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
> On Thu, Sep 23, 2010 at 05:29:57PM +0800, huang ying wrote:
>> Hi, Don,
>>
>> On Thu, Sep 23, 2010 at 12:07 AM, Don Zickus <dzickus@xxxxxxxxxx> wrote:
>> > On Wed, Sep 22, 2010 at 12:19:16AM +0200, Andi Kleen wrote:
>> >>
>> >> >
>> >> > I guess adding either another knob to override the hardware error option
>> >> > or tying it in with the panic_on_unknown_error option might make me more
>> >> > comfortable. ÂThat way enterprise customers can always just enable it by
>> >> > default and desktop users (for now) could have it off.
>> >>
>> >> Anything that needs explicit enabling is a bad idea, that
>> >> would lead to a lot of users running in "corrupt my data" mode.
>> >
>> > I know. ÂBut as I said earlier in my emails, I am trying to figure out how
>> > to deal with the fallout from unknown nmis turning into panics. ÂToday
>> > people see unknown nmis. ÂThey may or may not be corrupting data. ÂThey
>> > usually file a bug. ÂCurrently it is hard for me to diagnosis the problem.
>> > Usually the old 'upgrade your bios/firmware' does the trick. ÂSometimes it
>> > doesn't and people feel like their machines still run fine. ÂSo they
>> > ignore it (for good or for bad).
>> >
>> > Turning unknown nmis into panics would break their current setup without
>> > much gain. ÂSo I was trying to propose something temporarily until we
>> > could get a better infrastructure to isolate the source and provide better
>> > info on what to do.
>> >
>> > I agree with you that long term unknown nmis should be panics. ÂI just get
>> > nervous about doing that now from a support perspective.
>>
>> In fact, we use white list policy here. Only systems with HEST or
>> identified by chipset host bridge PCI ID will panic for unknown NMI.
>> So I think systems you worried about will not have this enabled.
>>
>> >> The code currently uses the presence of a HEST error table
>> >> to detect a server. HEST should be only available on servers.
>> >>
>> >> On servers at least panic should be default.
>> >
>> > Ok. ÂThat's fine. But then what. ÂWhat does a developer do with that
>> > panic? ÂThere's no useful info. ÂThat is sorta my problem. ÂThen again I
>> > do not know much about HEST.
>>
>> On some system, there is some hardware error log in BMC/BIOS. The
>> hardware error log can be gotten via IPMI or BIOS menu. Otherwise, can
>> we get some useful info for unknown NMI? If we can, can we collect the
>> info, then print it on console and save it into flash via ERST (part
>> of APEI too) before panic?
>
> Ok. ÂDoes the BIOS/BMC automatically do this? ÂCan we just print a message
> on panic saying checking your BIOS/BMC logs for more info?

Yes. BIOS/BMC automatically do that. And I will add it to panic message.

> I would love to add code to gather more useful info for unknown NMIs, but
> is it expected that HEST does some of this? ÂI guess what I am trying to
> figure out, if we are going to put intelligence to detect a HEST enabled
> machine and panic when unknown NMI comes along (presumably from HEST??),
> then can we leverage HEST at all to understand why the NMI happened or
> point the user to the BIOS/BMC to get more info. ÂIn other words, what
> value do we get HEST other than we detect its there, lets panic.

Yes. HEST can be used to report some hardware error information. I am
working on that now.

>> HEST is defined in ACPI spec 4.0 and later version in section named
>> APEI (ACPI Platform Error Interface). It is used to describe the error
>> sources of system. It should be available only on server platform.
>
> Ok. ÂDoes the kernel have intelligence to use it or the BIOS yet?

HEST works in kernel BIOS cooperative way. I am working on a HEST
driver which will get notified for NMI and collect the error
information reported by BIOS. But It is possible that some systems
have only BMC/BIOS log and do not report that to OS except unknown
NMI. The unknown NMI panic logic is for these systems.

Best Regards,
Huang Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/