Re: [RFC] x86, NMI, Treat unknown NMI as hardware error

From: Ingo Molnar
Date: Tue May 17 2011 - 04:50:57 EST



* Don Zickus <dzickus@xxxxxxxxxx> wrote:

> On Mon, May 16, 2011 at 01:29:34PM +0200, Ingo Molnar wrote:
> > > Interesting. Question though, what do you mean by 'event filtering'. Is
> > > that different then setting 'unknown_nmi_panic' panic on the commandline or
> > > procfs?
> > >
> > > Or are you suggesting something like registering another callback on the
> > > die_chain that looks for DIE_NMIUNKNOWN as the event, swallows them and
> > > implements the policy? That way only on HEST related platforms would
> > > register them while others would keep the default of 'Dazed and confused'
> > > messages?
> >
> > The idea is that "event filters", which are an existing upstream feature and
> > which can be used in rather flexible ways:
> >
> > http://lkml.org/lkml/2011/4/27/660
> >
> > Could be used to trigger non-standard policy action as well - such as to panic
> > the box.
> >
> > This would replace various very limited /debugfs and /sys event filtering hacks
> > (and hardcoded policies) such as arch/x86/kernel/cpu/mcheck/mce-severity.c, and
> > it would allow nonstandard behavior like 'panic the box on unknown NMIs' as
> > well.
> >
> > This could be set by the RAS daemon, and it could be propagated to the kernel
> > boot line as well, where event filter syntax would look like this:
> >
> > events=nmi::unknown"if (reason == 0) panic();"
>
> Wow. ok. I believe that is the most complicated kernel boot param I have
> ever seen. :-) Powerful, no doubt.

It would not have to be typed normally - the defaults would still be sane.

> So this would sorta be a meta-notifier? I guess you are saying platforms
> that implement something like HEST could setup an event like that to trigger
> the behaviour they want on a per-platform basis?

Yeah - or if they dislike the default they could tweak the policy action in a
rather flexible way.

> My only argument against it would be sorta of what Ying complains about is
> that you start to lose track of who is hooked into the NMI. It is one thing
> to search for all the users in the die_notifier to track down who is
> swallowing NMIs. But to look for event users, is going to be harder. Unless
> the events processing has a switch to turn on logging? :-)

Yeah, all such types of filters should be printed during bootup, to make it
really clear what is happening.

We also want all the current state visible readily under /sys/events or
/events.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/