Generic callfunction IPI problems

From: Jeremy Fitzhardinge
Date: Sun Jul 06 2008 - 10:50:51 EST


Hi Jens,

I'm seeing these oopses when running under Xen:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
PGD 0 Oops: 0000 [1] SMP CPU 15 Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip #306
RIP: e030:[<ffffffff8105de9a>] [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
RSP: e02b:ffff88007f653e98 EFLAGS: 00010046
RAX: ffffffff815fe6e0 RBX: ffff88007e523cc8 RCX: 0000000000000001
RDX: ffffc10000200200 RSI: 0000000000000001 RDI: ffffffff81693240
RBP: ffff88007f653eb8 R08: ffff88007f653ec8 R09: 0002db11ddd83820
R10: ffff880000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 000000000000000f R15: 0000000000000040
FS: 00007f1dadd907a0(0000) GS:ffff88007ff30080(0000) knlGS:0000000000000000
CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
Process swapper (pid: 0, threadinfo ffff88007ff90000, task ffff88007ff57100)
Stack: ffff88007ff2f4c0 0000000000000000 0000000000000000 000000000000004d
ffff88007f653ec8 ffffffff8100dea7 ffff88007f653ef8 ffffffff810747c5
ffffffff816959c0 000000000000004d ffff88007ff2f4c0 ffffffff81695a10
Call Trace:
<IRQ> [<ffffffff8100dea7>] xen_call_function_interrupt+0xe/0x167
[<ffffffff810747c5>] handle_IRQ_event+0x2e/0x65
[<ffffffff81075e5b>] handle_level_irq+0xb5/0x116
[<ffffffff81013f34>] do_IRQ+0xf7/0x177
[<ffffffff811ba227>] xen_evtchn_do_upcall+0xb3/0x136
[<ffffffff8141558e>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff810093aa>] ? _stext+0x3aa/0x1000
[<ffffffff810093aa>] ? _stext+0x3aa/0x1000
[<ffffffff8100a42e>] ? xen_safe_halt+0x10/0x1a
[<ffffffff8100ba26>] ? xen_idle+0x46/0x5c
[<ffffffff8100eb60>] ? cpu_idle+0xca/0x101
[<ffffffff8140bc7d>] ? cpu_bringup_and_idle+0x8a/0x8f


Code: e8 fc 96 fc ff 90 41 f6 44 24 20 01 74 08 41 83 64 24 20 fe eb 11 49 8d 7c 24 38 48 c7 c6 90 dd 05 81 e8 6f 92 01 00 4d 8b 24 24 <49> 8b 04 24 49 81 fc e0 e6 5f 81 0f 18 08 0f 85 11 ff ff ff 5b RIP [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
RSP <ffff88007f653e98>
CR2: 0000000000000000
Kernel panic - not syncing: Fatal exception in interrupt


They're pretty rare - this system did a kernbench run on a 16 vcpu system with no problems, then oopsed this way when I left it idle overnight.

One interesting data point is that I've been experimenting with more virtualization-friendly spinlock algorithms. If I replace ticket locks with the old lock-byte algorithm, I see this much more frequently (and a spin-and-block algorithm generally doesn't get through boot). I wonder if there's a race which is masked by ticket locks' strict FIFO algorithm? (But this particular oops was with completely standard ticketlocks in place.)

I've been running your old generic IPI patches for a while with no problems; this seems to be specific to the version in tip.git. I haven't looked to see what differences there are yet.

I've also only observed problems under Xen, but I haven't done much testing on real hardware.

Thanks,
J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/