Re: smp_call_function_single lockups

From: Chris J Arges
Date: Mon Mar 30 2015 - 23:16:24 EST


On Thu, Feb 19, 2015 at 02:45:54PM -0800, Linus Torvalds wrote:
> On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > Is this worth looking at? Or is it something spurious? I might have
> > gotten the vectors wrong, and maybe the warning is not because the ISR
> > bit isn't set, but because I test the wrong bit.
>
> I edited the patch to do ratelimiting (one per 10s max) rather than
> "once". And tested it some more. It seems to work correctly. The irq
> case during 8042 probing is not repeatable, and I suspect it happens
> because the interrupt source goes away (some probe-time thing that
> first triggers an interrupt, but then clears it itself), so it doesn't
> happen every boot, and I've gotten it with slightly different
> backtraces.
>
> But it's the only warning that happens for me, so I think my code is
> right (at least for the cases that trigger on this machine). It's
> definitely not a "every interrupt causes the warning because the code
> was buggy, and the WARN_ONCE() just printed the first one".
>
> It would be interesting to hear if others see spurious APIC EOI cases
> too. In particular, the people seeing the IPI lockup. Because a lot of
> the lockups we've seen have *looked* like the IPI interrupt just never
> happened, and so we're waiting forever for the target CPU to react to
> it. And just maybe the spurious EOI could cause the wrong bit to be
> cleared in the ISR, and then the interrupt never shows up. Something
> like that would certainly explain why it only happens on some machines
> and under certain timing circumstances.
>
> Linus
>

Linus,

I'm able to reproduce this IPI lockup easily now when using specific hardware
and nested KVM VMs. However, it seems to only occur when using certain host
hardware. For example:
- Xeon E5620 (Westmere-EP) (fam: 06, model: 2c)
- Xeon E312xx (Sandy Bridge) (fam: 06, model: 2a)
Now I'm not sure if this indicates hardware LAPIC issues or if the timing in
these processors make it more likely to hit this issue. So far I haven't seen
the issue on other CPUs.

To set this up, I've done the following (L0 being the Xeon box):
1) Create an L1 KVM VM with 2 vCPUs (single vCPU case doesn't reproduce)
2) Create an L2 KVM VM inside the L1 VM with 1 vCPU
3) Run something like 'stress -c 1 -m 1 -d 1 -t 1200' inside the L2 VM

Sometimes this is sufficient to reproduce the issue, I've observed that running
KSM in the L1 VM can agitate this issue (it calls native_flush_tlb_others).
If this doesn't reproduce then you can do the following:
4) Migrate the L2 vCPU randomly (via virsh vcpupin --live OR tasksel) between
L1 vCPUs until the hang occurs.

Pinning L1 vCPUs to L0 initially (i.e. 0->0 1->1) causes the hangs to not occur
in my testing. (Which makes sense because we're unlikely to do real IPIs.)

I've been able to repro with your patch and observed the WARN_ON when booting a
VM on affected hardware and non affected hardware:

[ 13.613531] ------------[ cut here ]------------
[ 13.613531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x84/0x90()
[ 13.613531] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKSUM(E) iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) serio_raw(E) i2c_piix4(E) parport_pc(E) pvpanic(E) parport(E) mac_hid(E) nls_utf8(E) isofs(E) psmouse(E) floppy(E) cirrus(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) drm_kms_helper(E) drm(E) pata_acpi(E)
[ 13.613531] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G E 4.0.0-rc6-460f8calinus1 #4
[ 13.613531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 13.613531] ffffffff81a911a6 ffff88013fc03eb8 ffffffff817999a7 0000000000000007
[ 13.613531] 0000000000000000 ffff88013fc03ef8 ffffffff810734ca 0000000000000092
[ 13.613531] 0000000000000011 ffff8800b8bbac00 000000000000001f 00000000000000b1
[ 13.613531] Call Trace:
[ 13.613531] <IRQ> [<ffffffff817999a7>] dump_stack+0x45/0x57
[ 13.613531] [<ffffffff810734ca>] warn_slowpath_common+0x8a/0xc0
[ 13.613531] [<ffffffff810735ba>] warn_slowpath_null+0x1a/0x20
[ 13.613531] [<ffffffff8104d3f4>] apic_ack_edge+0x84/0x90
[ 13.613531] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120
[ 13.613531] [<ffffffff81016aa2>] handle_irq+0x22/0x40
[ 13.613531] [<ffffffff817a3b9f>] do_IRQ+0x4f/0x140
[ 13.613531] [<ffffffff817a196d>] common_interrupt+0x6d/0x6d
[ 13.613531] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20
[ 13.613531] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10
[ 13.613531] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0
[ 13.613531] [<ffffffff8101ecde>] default_idle+0x1e/0xc0
[ 13.613531] [<ffffffff8101f6cf>] arch_cpu_idle+0xf/0x20
[ 13.613531] [<ffffffff810b5d6f>] cpu_startup_entry+0x2ff/0x420
[ 13.613531] [<ffffffff8178e9b7>] rest_init+0x77/0x80
[ 13.613531] [<ffffffff81d50f97>] start_kernel+0x43c/0x449
[ 13.613531] [<ffffffff81d50120>] ? early_idt_handlers+0x120/0x120
[ 13.613531] [<ffffffff81d504d7>] x86_64_start_reservations+0x2a/0x2c
[ 13.613531] [<ffffffff81d5062b>] x86_64_start_kernel+0x152/0x161
[ 13.613531] ---[ end trace 4512c19aad733ce8 ]---

I modified the posted patch with the following:
diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index bf32309..dc3e192 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -441,7 +441,7 @@ static inline void ack_APIC_irq(int vector)
if (vector >= 16) {
unsigned v = apic_read(APIC_ISR + ((vector & ~0x1f) >> 1));
v >>= vector & 0x1f;
- WARN_ON_ONCE(!(v & 1));
+ WARN(!(v & 1), "ack_APIC_irq: vector = %0x\n", vector);
}
/*
* ack_APIC_irq() actually gets compiled as a single instruction

And it showed vector = 1b when booting. However, when I run the reproducer on
an affected machine I get the following WARNs before the hang:

[ 36.301282] ------------[ cut here ]------------
[ 36.301299] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x93/0xa0()
[ 36.301301] ack_APIC_irq: vector = e1
[ 36.301303] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKS
UM(E) iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) parport_pc(E) pvpanic(E)
parport(E) serio_raw(E) mac_hid(E) i2c_piix4(E) nls_utf8(E) isofs(E) cirrus(E) psmouse(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) floppy(E) drm_kms_helper(E) drm(E) pata_acpi(E)
[ 36.301344] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W E 4.0.0-rc6-655d7f2linus2 #5
[ 36.301346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 36.301348] ffffffff81a910f2 ffff88013fc03e68 ffffffff817998e7 0000000000000007
[ 36.301352] ffff88013fc03eb8 ffff88013fc03ea8 ffffffff810734ca ffffffff81c0c0f8
[ 36.301355] 0000000000000001 00000000000000e1 0000000000000019 0000000000000051
[ 36.301359] Call Trace:
[ 36.301361] <IRQ> [<ffffffff817998e7>] dump_stack+0x45/0x57
[ 36.301373] [<ffffffff810734ca>] warn_slowpath_common+0x8a/0xc0
[ 36.301377] [<ffffffff81073546>] warn_slowpath_fmt+0x46/0x50
[ 36.301383] [<ffffffff81405420>] ? pci_msi_unmask_irq+0x10/0x20
[ 36.301387] [<ffffffff8104d3b3>] apic_ack_edge+0x93/0xa0
[ 36.301392] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120
[ 36.301399] [<ffffffff81016aa2>] handle_irq+0x22/0x40
[ 36.301403] [<ffffffff817a3adf>] do_IRQ+0x4f/0x140
[ 36.301407] [<ffffffff817a18ad>] common_interrupt+0x6d/0x6d
[ 36.301409] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20
[ 36.301417] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10
[ 36.301422] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0
[ 36.301426] [<ffffffff8101ecce>] default_idle+0x1e/0xc0
[ 36.301430] [<ffffffff8101f6bf>] arch_cpu_idle+0xf/0x20
[ 36.301433] [<ffffffff810b5d6f>] cpu_startup_entry+0x2ff/0x420
[ 36.301438] [<ffffffff8178e8f7>] rest_init+0x77/0x80
[ 36.301443] [<ffffffff81d50f97>] start_kernel+0x43c/0x449
[ 36.301446] [<ffffffff81d50120>] ? early_idt_handlers+0x120/0x120
[ 36.301450] [<ffffffff81d504d7>] x86_64_start_reservations+0x2a/0x2c
[ 36.301453] [<ffffffff81d5062b>] x86_64_start_kernel+0x152/0x161
[ 36.301455] ---[ end trace 8f10e066e6d6eddf ]---
[ 40.430515] ------------[ cut here ]------------
[ 40.430531] WARNING: CPU: 0 PID: 0 at ./arch/x86/include/asm/apic.h:444 apic_ack_edge+0x93/0xa0()
[ 40.430533] ack_APIC_irq: vector = 22
[ 40.430535] Modules linked in: ipt_MASQUERADE(E) nf_nat_masquerade_ipv4(E) iptable_nat(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack_ipv4(E) nf_defrag_ipv4(E) xt_conntrack(E) nf_conntrack(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_CHECKS
UM(E) iptable_mangle(E) xt_tcpudp(E) bridge(E) stp(E) llc(E) ip6table_filter(E) ip6_tables(E) iptable_filter(E) ip_tables(E) ebtable_nat(E) ebtables(E) x_tables(E) dm_crypt(E) ppdev(E) kvm_intel(E) kvm(E) parport_pc(E) pvpanic(E)
parport(E) serio_raw(E) mac_hid(E) i2c_piix4(E) nls_utf8(E) isofs(E) cirrus(E) psmouse(E) syscopyarea(E) sysfillrect(E) sysimgblt(E) ttm(E) floppy(E) drm_kms_helper(E) drm(E) pata_acpi(E)
[ 40.430576] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W E 4.0.0-rc6-655d7f2linus2 #5
[ 40.430578] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 40.430581] ffffffff81a910f2 ffff88013fc03e68 ffffffff817998e7 0000000000000007
[ 40.430584] ffff88013fc03eb8 ffff88013fc03ea8 ffffffff810734ca 0000000000000202
[ 40.430587] 0000000000000002 0000000000000022 000000000000001d 0000000000000091
[ 40.430591] Call Trace:
[ 40.430593] <IRQ> [<ffffffff817998e7>] dump_stack+0x45/0x57
[ 40.430605] [<ffffffff810734ca>] warn_slowpath_common+0x8a/0xc0
[ 40.430609] [<ffffffff81073546>] warn_slowpath_fmt+0x46/0x50
[ 40.430615] [<ffffffff81405420>] ? pci_msi_unmask_irq+0x10/0x20
[ 40.430631] [<ffffffff8104d3b3>] apic_ack_edge+0x93/0xa0
[ 40.430637] [<ffffffff810cf8e7>] handle_edge_irq+0x57/0x120
[ 40.430643] [<ffffffff81016aa2>] handle_irq+0x22/0x40
[ 40.430647] [<ffffffff817a3adf>] do_IRQ+0x4f/0x140
[ 40.430651] [<ffffffff817a18ad>] common_interrupt+0x6d/0x6d
[ 40.430653] <EOI> [<ffffffff810def08>] ? hrtimer_start+0x18/0x20
[ 40.430661] [<ffffffff8105a356>] ? native_safe_halt+0x6/0x10
[ 40.430666] [<ffffffff810d5623>] ? rcu_eqs_enter+0xa3/0xb0
[ 40.430670] [<ffffffff8101ecce>] default_idle+0x1e/0xc0
[ 40.430674] [<ffffffff8101f6bf>] arch_cpu_idle+0xf/0x20
[ 40.430678] [<ffffffff810b5d6f>] cpu_startup_entry+0x2ff/0x420
[ 40.430684] [<ffffffff8178e8f7>] rest_init+0x77/0x80
[ 40.430689] [<ffffffff81d50f97>] start_kernel+0x43c/0x449
[ 40.430693] [<ffffffff81d50120>] ? early_idt_handlers+0x120/0x120
[ 40.430696] [<ffffffff81d504d7>] x86_64_start_reservations+0x2a/0x2c
[ 40.430699] [<ffffffff81d5062b>] x86_64_start_kernel+0x152/0x161
[ 40.430702] ---[ end trace 8f10e066e6d6ede0 ]---

So vector = e1 then 22 before the hang.

[ 45.728212] kvm [1602]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
[ 53.391974] random: nonblocking pool is initialized
[ 80.076005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1602]

Some things I've already done are:
- run with nox2apic on L0,L1 (so apic->name == "flat"), this still reproduces
- run with apic=off on L0,L1, this still reproduces

Anyway, maybe this sheds some more light on this issue. I can reproduce this at
will, so let me know of other experiments to do.

Thanks,
--chris j arges

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/