Re: [PATCH] x86, UV: Fix NMI handler for UV platforms

From: Jack Steiner
Date: Wed Mar 23 2011 - 12:33:05 EST

On Tue, Mar 22, 2011 at 06:05:05PM -0400, Don Zickus wrote:
> On Tue, Mar 22, 2011 at 04:25:19PM -0500, Jack Steiner wrote:
> > > > AFAICT, the UV nmi handler is not consuming extra NMI interrupts. I can't
> > > > rule out that I'm missing something but I don't see it.
> > >
> > > What happens if you put the UV nmi handler below the hw_perf handler in
> > > priority? I assume the DIE_NMIUNKNOWN snippet in the hw_perf handler will
> > > swallow some of the UV NMIs, but more importantly does it still generate
> > > the hang you see?
> >
> > I verified that the failures ("perf top" stops) are the same on both RHEL6.1 & the
> > latest x86 2.6.38+ tree.
> Thanks for testing that.
> >
> > I switched priorities & as expected, "perf top" no longer hangs. I see an occassional
> > missed UV NMI - about 1 every minute. I also see a few "dazed" messages as
> > well - 3 in a 5 minute period. This testing was done on a 2.6.38+ kernel.
> >
> > I'm running on a 48p system.
> >
> > Ideas?
> Wow, interesting.
> The first thing is in 'uv_handle_nmi' can you change that from
> DIE_NMIUNKNOWN back to DIE_NMI. Originally I set it to DIE_NMIUNKNOWN
> because I didn't think you guys had the ability to determine if your BMC
> generated the NMI or not. Recently George B. said you guys add a register
> bit to determine this, so I am wondering if by promoting this would fix
> the missed UV NMI. I am speculating this is being swallowed by the
> hw_perf DIE_NMIUNKNOWN exception path.

Correct. I recently added a register that indicates the BMC sent an NMI.

Hmmm. Looks like I have been running with DIE_NMI. I think that came
from porting the patch from RHEL6 to upstream.

However, neither DIE_NMIUNKNOWN or DIE_NMI gives the desired behavior (2.6.38+).

- Using DIE_NMIUNKNOWN, I see many more "dazed" messages but no
perf top lockup. I see ~3 "dazed" messages per minute. UV NMIs are
being sent at a rate of 30/min, ie. ~10% failure rate.

- Using DIE_NMI, no "dazed" messages but perf top hangs about once a
minute (rough estimate).

I wonder if we need a different approach to handling NMIs. Instead of using
the die_notifier list, introduce a new notifier list reserved exclusively
for NMIs. When an NMI occurs, all registered functions are unconditionally called.
If any function accepts the NMI, the remaining functions are still called but
the NMI is considered to have been valid (handled) & the "dazed" message
is suppressed.

This is more-or-less functionally equivalent to the last patch I posted but
may be cleaner. At a minimum, it is easier to understand the interactions
between the various handlers.

> Second the "dazed" messages are being seen on other machines (currently
> core2quads) when using perf with lots of NMI events. So you might be
> seeing a second more common issue there. I still need to find time to
> debug that.
> Finally, I am trying to scratch my head about the 'perf top' no longer
> hangs part. The only thing I can think of is under high perf load (with
> out extra NMIs by your BMC), we have seen extra NMIs get generated while
> processing the current NMI (mainly because Nehalems have I think 4 or 8
> PMUs that can be activate at once, so multiple NMIs can trigger here).
> But we can recover from this because we check _all_ the PMIs during the
> NMI (which currently always comes from the PMU).
> Now this extra NMI from the PMU can also happen on a singlely activated
> PMU because we reload the PMU, then check the events to see if we should
> disable it. By the time we finish checking (and determine we are not done
> yet), the event could have rolled over and generated another NMI before we
> have finished processing the current one.
> So throw in an external NMI into the above situation (which gets dropped
> as the third NMI I believe if I read the history of these NMI things
> correctly), then it is possible that if uv_handle_nmi is called first it
> could swallow the extra NMI as its own and leave the hw_perf hanging.
> (that's a mouthful, huh?)
> Then again with the priorities switched I guess the opposite is true too,
> that your BMC is left missing an event.
> This sort of supports the need for your patch earlier or something similar
> which says ignore the handler's return code and process all the events on
> the die_chain anyway. And if noone has handled the NMI, then trigger an
> unknown NMI.
> Unless there is a way to determine if an NMI is latched or not before
> issuing the iret and if so assumed we dropped an NMI and process everyone.
> I'll need to think of a way to prove all this in the morning (or maybe
> later).
> I hope that makes some sense as it is late and my brain is shutting down.
> Cheers,
> Don
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at