RE: nmi_watchdog fix for x86_64 to be more like i386

From: Thomas Gleixner
Date: Fri Oct 05 2007 - 16:37:50 EST


On Thu, 4 Oct 2007, Pallipadi, Venkatesh wrote:
> >-----Original Message-----
> >From: linux-kernel-owner@xxxxxxxxxxxxxxx
> >[mailto:linux-kernel-owner@xxxxxxxxxxxxxxx] On Behalf Of
> >Thomas Gleixner
> >Sent: Monday, October 01, 2007 11:19 PM
> >To: Andi Kleen
> >Cc: Arjan van de Ven; David Bahi; LKML;
> >linux-rt-users@xxxxxxxxxxxxxxx; Andrew Morton; Ingo Molnar;
> >Gregory Haskins
> >Subject: Re: nmi_watchdog fix for x86_64 to be more like i386
> >
> >>
> >> The only workaround for chipsets ignoring IRQ affinity would
> >be to keep
> >> track on which CPU irq 0 happens and then restart APIC timer
> >interrupts
> >> on the others (or send IPIs) as needed. But that would be
> >fairly ugly.
> >
> >The clock events code does handle this already. The broadcast
> >interrupt
> >can come in on any cpu. It's just the nmi watchdog which would
> >be affected
> >by that.
> >
>
> Probably we can workaround this by keeping track of IRQ0 count at percpu
> level and
> use local apic timer + this percpu counter in NMI. Or just increment
> local
> apic timer count in IRQ0 with nohz enabled.

No, I tried that. It's ugly.

The per cpu accounting is the correct way to go if we want to take
care of those systems, which ignore the CPU0 binding of irq0.

See patch against the x86 tree below.

tglx

-------------------->
commit 093976c7ad206a008bd5de4619f40f6bca4a79c3
Author: Thomas Gleixner <tglx@xxxxxxxxxxxxxxxxxxxxxxxxxxx>
Date: Fri Oct 5 22:19:18 2007 +0200

x86: Fix irq0 / local apic timer accounting

The clock events merge introduced a change to the nmi watchdog code to
handle the not longer increasing local apic timer count in the
broadcast mode. This is fine for UP, but on SMP it pampers over a
stuck CPU which is not handling the broadcast interrupt due to the
unconditional sum up of local apic timer count and irq0 count.

To cover all cases we need to keep track on which CPU irq0 is
handled. In theory this is CPU#0 due to the explicit disabling of irq
balancing for irq0, but there are systems which ignore this on the
hardware level. The per cpu irq0 accounting allows us to remove the
irq0 to CPU0 binding as well.

Add a per cpu counter for irq0 and evaluate this instead of the global
irq0 count in the nmi watchdog code.

Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>

diff --git a/arch/x86/kernel/nmi_32.c b/arch/x86/kernel/nmi_32.c
index c7227e2..95d3fc2 100644
--- a/arch/x86/kernel/nmi_32.c
+++ b/arch/x86/kernel/nmi_32.c
@@ -353,7 +353,8 @@ __kprobes int nmi_watchdog_tick(struct pt_regs * regs, unsigned reason)
* Take the local apic timer and PIT/HPET into account. We don't
* know which one is active, when we have highres/dyntick on
*/
- sum = per_cpu(irq_stat, cpu).apic_timer_irqs + kstat_cpu(cpu).irqs[0];
+ sum = per_cpu(irq_stat, cpu).apic_timer_irqs +
+ per_cpu(irq_stat, cpu).irq0_irqs;

/* if the none of the timers isn't firing, this cpu isn't doing much */
if (!touched && last_irq_sums[cpu] == sum) {
diff --git a/arch/x86/kernel/time_32.c b/arch/x86/kernel/time_32.c
index 19a6c67..3571d0a 100644
--- a/arch/x86/kernel/time_32.c
+++ b/arch/x86/kernel/time_32.c
@@ -157,6 +157,9 @@ EXPORT_SYMBOL(profile_pc);
*/
irqreturn_t timer_interrupt(int irq, void *dev_id)
{
+ /* Keep nmi watchdog up to date */
+ per_cpu(irq_stat, cpu).irq0_irqs++;
+
#ifdef CONFIG_X86_IO_APIC
if (timer_ack) {
/*
diff --git a/include/asm-x86/hardirq_32.h b/include/asm-x86/hardirq_32.h
index ed7cf97..9188635 100644
--- a/include/asm-x86/hardirq_32.h
+++ b/include/asm-x86/hardirq_32.h
@@ -9,6 +9,7 @@ typedef struct {
unsigned long idle_timestamp;
unsigned int __nmi_count; /* arch dependent */
unsigned int apic_timer_irqs; /* arch dependent */
+ unsigned int irq0_irqs;
} ____cacheline_aligned irq_cpustat_t;

DECLARE_PER_CPU(irq_cpustat_t, irq_stat);

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/