Re: nmi_watchdog=2 regression in 2.6.21

From: Stephane Eranian
Date: Tue Aug 28 2007 - 15:47:17 EST


Daniel,

On Tue, Aug 28, 2007 at 11:30:35AM -0700, Daniel Walker wrote:
>
> Here is some output from my boot log,
>
> md: raid1 personality registered for level 1
> device-mapper: ioctl: 4.11.0-ioctl (2006-10-12) initialised: dm-devel@xxxxxxxxxx
> ip_tables: (C) 2000-2006 Netfilter Core Team
> TCP cubic registered
> NET: Registered protocol family 1
> NET: Registered protocol family 17
> Testing NMI watchdog ... CPU#0: NMI appears to be stuck (0->0)!
> CPU#1: NMI appears to be stuck (0->0)!
> CPU#2: NMI appears to be stuck (0->0)!
> CPU#3: NMI appears to be stuck (0->0)!
> Starting balanced_irq
> <hangs>
>
> So it does appear to make it out of the check_nmi_watchdog() function
> even tho there is a problem with the watchdog .. "Starting balance_irq"
> is in balanced_irq_init(). It usually hangs at this spot , but with
> initcall_debug on it once hung a little further down ..

I think I found the problem. As I suspected, it seems there is an assymetry
between the 1st end 2nd counter (just like what they have on P6 core). Yet
for architectural perfmon v1, this restriction is supposed to be lifted.

Unfortunately, a quick look at the errata document at
http://download.intel.com/design/mobile/SPECUPDT/30922212.pdf

for Core Duo shows bugs A49 as 'nofix':
Core Duo processor has a bug which renders the enable bit (22) of
PERFEVTSEL1 inoperative. The processor behaves like former P6 cores,
the enable bit of PERFEVTSEL0 controls the activation of both counters

That explains why you get the 'NMI stuck' message when using PERFEVTSEL1.
I suspect when using PERFEVTSEL0, then NMI watchdog is not stuck. So it
is possible that in case NMI is stuck the code does not cleanly shutdown the
NMI interrupt and you get some spurious NMI interrupt later in the boot at
you are stuck because you are holding a lock. We need to look at the error
path of check_nmi_watchdog(). I glance through it and could not find the place
where the APIC vector is cleared.

> It seems like there might be some other interrupt going off too early,
> that causes the system to hang..
>
> As you can see from the log above, this system is a quad.. Two physical
> cpus each with two cores.
>

--

-Stephane
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/