Re: [PATCH RFC] NMI Re-introduce un[set]_nmi_callback

From: Don Zickus
Date: Thu Sep 04 2008 - 14:27:46 EST


On Thu, Sep 04, 2008 at 07:52:31PM +0200, Andi Kleen wrote:
> On Thu, Sep 04, 2008 at 01:20:52PM -0400, Don Zickus wrote:
> > On Thu, Sep 04, 2008 at 05:52:17PM +0200, Andi Kleen wrote:
> > > Then if there's a chipset specific NMI driver it could
> > > also check if the chipset raised it. That would be a possible
> > > solution for HP -- they would need to implement such a driver
> > > for their systems with the special watchdog.
> >
> > The thing with HP's special watchdog timer is that it does _not_ have a
> > chipset specific NMI it is trying to catch. HP is going on the assumption
> > that _all_ NMIs are /bad/ and they want to catch _every_ NMI, log it, and
> > reboot the system.
>
> That's my point. If you have drivers which can identify all other
> NMIs then the left over NMIs must come from that watchdog driver.
> So they just need drivers which can do that for their chipsets.

Except their chipsets are _not_ producing NMIs. They just want to
supercede all the other NMI handlers. For example if an EDAC NMI came in,
they don't want the EDAC handler to try and recover from it, HP just wants
their NMI watchdog to grab the NMI, log it and reboot.

>
> It's not race free, but that's simply not possible with the x86
> NMI architecture.

I agree.

>
> Better would be probably to just configure the watchdog
> to reboot the system directly on its own. Most other watchdogs
> I'm aware of do that. That's more reliable anyways because the system
> might be wedged enough to not be able to process NMIs anymore.

The trick is they want to log it in a special way (BIOS or NVRAM or
something I forget) before rebooting.

>
> >
> > Now obviously NMIs from kgdb and oprofile are not the ones a system should
> > panic on but this breaks HP's assumptions.
> >
> > So that is part of the problem. How do you become a catch-all for NMIs in
> > a system, to process as you wish, but ignore all the 'safe' NMIs?
>
> To be fully reliable: you need a new NMI architecture or move the event
> somewhere else.
> To be reasonable reliable (assuming NMis are not very frequent): you
> need drivers for all NMI sources that can identify them.

Yeah I know. Originally I thought this would be easy, just replace the
default handler. But once the mention of kgdb and oprofile using the NMIs
came up, I realized we are almost back to square one. :-(

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/