Re: nmi_watchdog=2 regression in 2.6.21

From: Björn Steinbrink
Date: Fri Aug 31 2007 - 14:06:40 EST


On 2007.08.31 09:35:23 -0700, Daniel Walker wrote:
> On Fri, 2007-08-31 at 09:21 -0700, Stephane Eranian wrote:
> > In this patch, the setup_*() routine now extract the MSR from the wd_ops
> > to copy them into the nmi_watchdog_ctlblk. This is not done for P4 because
> > of the special and ugly case of HT.
> >
> > With this approach, we can now create a custom wd_ops for CoreDuo that is
> > a clone of the intel_arch_wd_ops, except for the MSR.
> >
> > Could you try this one instead?
>
> So I tested your patch unchanged and the system boots, and the
> check_nmi_watchdog() passes .. However, the nmi stops ticking right
> after bootup,
>
> >>From my /proc/interrupts below,
>
> CPU0 CPU1 CPU2 CPU3
> 0: 108 0 0 0 IO-APIC-edge timer
> 1: 0 0 0 8 IO-APIC-edge i8042
> 4: 3427 0 0 1 IO-APIC-edge serial
> 8: 1 0 0 1 IO-APIC-edge rtc
> 12: 0 0 0 113 IO-APIC-edge i8042
> 14: 1128 0 0 10 IO-APIC-edge ide0
> 16: 1664 0 0 1 IO-APIC-fasteoi uhci_hcd:usb2, eth0
> 18: 0 0 0 0 IO-APIC-fasteoi ehci_hcd:usb1
> 19: 0 0 0 0 IO-APIC-fasteoi uhci_hcd:usb3
> 20: 0 0 0 1 IO-APIC-fasteoi acpi
> NMI: 1670 1453 1097 967
> LOC: 48001 48002 48000 48006
> ERR: 0
> MIS: 0
>
>
> The NMI field never changes ..
>
> So I added another change which looked appropriate,
>
> @@ -674,6 +688,7 @@ unsigned lapic_adjust_nmi_hz(unsigned hz
> {
> struct nmi_watchdog_ctlblk *wd = &__get_cpu_var(nmi_watchdog_ctlblk);
> if (wd->perfctr_msr == MSR_P6_PERFCTR0 ||
> + wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR0 ||
> wd->perfctr_msr == MSR_ARCH_PERFMON_PERFCTR1)
> hz = adjust_for_32bit_ctr(hz);
> return hz;
>
>
> Unfortunately that didn't fix anything, but I have a feeling is has

That's because MSR_P6_PERFCTR0 is the same as MSR_ARCH_PERFMON_PERFCTR0.
But I'd personally add that change anyway, as it makes the code less
tricky and the compiler should optimize it away.

> something to do with the nmi hertz adjustment that happens after
> check_nmi_watchdog() ..

Hm hm, does the same thing (watchdog stuck after check) happen with
older kernels, ie. those before Stephane's changeset that made it use
PERFCTR1?

Maybe you could "activate" the Dprintk in write_watchdog_counter32() to
see which value gets written to the MSR? (I don't see any switch to
activate it, so maybe just s/Dprintk(/printk(KERN_WHATEVER / ?)

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/