Re: v2.6.26-rc9: kernel BUG at kernel/sched.c:5858!

From: Vegard Nossum
Date: Fri Jul 11 2008 - 14:00:27 EST


On Fri, Jul 11, 2008 at 7:51 PM, Vegard Nossum <vegard.nossum@xxxxxxxxx> wrote:
> Now I tested using serial console, but nothing new:
>
> CPU0 attaching NULL sched-domain.
> CPU1 attaching NULL sched-domain.
> CPU0 attaching sched-domain:
> domain 0: span 0-1
> groups: 0 1
> domain 1: span 0-1
> groups: 0-1
> CPU1 attaching sched-domain:
> domain 0: span 0-1
> groups: 1 0
> domain 1: span 0-1
> groups: 0-1
> * [ 7 ] comm (ksoftirqd/1), orig_cpu (1), dst_cpu (1), cpu (1)
> CPU 1 is now offline
> * [ 1228 ] comm (kjournald), orig_cpu (0), dst_cpu (0), cpu (0)
> * [ 3113 ] comm (klogd), orig_cpu (0), dst_cpu (0), cpu (0)
> BUG: spinlock recursion on CPU#0, syslogd/3110
>
> and here the output stops. I find this REALLY strange, look at the
> spinlock recursion code:
>
> printk(KERN_EMERG "BUG: spinlock %s on CPU#%d, %s/%d\n",
> msg, raw_smp_processor_id(),
> current->comm, task_pid_nr(current));
> printk(KERN_EMERG " lock: %p, .magic: %08x, .owner: %s/%d, "
> ".owner_cpu: %d\n",
> lock, lock->magic,
> owner ? owner->comm : "<none>",
> owner ? task_pid_nr(owner) : -1,
> lock->owner_cpu);
>
> why would it not be able to print the second line?

Enabling NMI watchdog gives us this additional info (whoa, what a huge
stacktrace):

BUG: spinlock recursion on CPU#0, bash/3832
BUG: NMI Watchdog detected LOCKUP on CPU0, ip c010a5a1, registers:
Pid: 3832, comm: bash Not tainted (2.6.26-rc9-00103-g2702484 #18)
EIP: 0060:[<c010a5a1>] EFLAGS: 00200082 CPU: 0
EIP is at native_read_tsc+0x11/0x20
EAX: 4633e3b2 EBX: c06c6d00 ECX: 00000001 EDX: 00000086
ESI: 00000001 EDI: 00000000 EBP: f6307a68 ESP: f6307a68
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3832, ti=f6306000 task=f5b71fe0 task.ti=f6306000)
Stack: f6307a80 c027d6b2 00000000 c06c6d00 00360dc6 00000000 f6307a88 c027d5c9
f6307ac0 c028b8ea c0148e0a c06da1a0 c1fd8180 c06da1a0 f6307aac 00000001
b25fadf4 00000000 b25fadf4 c06c6d10 c06c6d00 00200092 f6307ae0 c0572cb3
Call Trace:
[<c027d6b2>] ? delay_tsc+0x22/0xa4
[<c027d5c9>] ? __delay+0x9/0x10
[<c028b8ea>] ? _raw_spin_lock+0xea/0x180
[<c0148e0a>] ? atomic_notifier_call_chain+0x1a/0x20
[<c0572cb3>] ? _spin_lock_irqsave+0x63/0x80
[<c01236fb>] ? __wake_up+0x1b/0x50
[<c01236fb>] ? __wake_up+0x1b/0x50
[<c0130885>] ? release_console_sem+0x1b5/0x1e0
[<c01306cb>] ? wake_up_klogd+0x3b/0x40
[<c0130899>] ? release_console_sem+0x1c9/0x1e0
[<c0130d5c>] ? vprintk+0x29c/0x410
[<c01187e0>] ? nmi_watchdog_tick+0x90/0x160
[<c01063e5>] ? do_nmi+0xb5/0x2c0
[<c05734b3>] ? nmi_stack_correct+0x26/0x2b
[<c0130eeb>] ? printk+0x1b/0x20
[<c028b623>] ? spin_bug+0x63/0x100
[<c028b96c>] ? _raw_spin_lock+0x16c/0x180
[<c0148e0a>] ? atomic_notifier_call_chain+0x1a/0x20
[<c01236fb>] ? __wake_up+0x1b/0x50
[<c0572cb3>] ? _spin_lock_irqsave+0x63/0x80
[<c01236fb>] ? __wake_up+0x1b/0x50
[<c01236fb>] ? __wake_up+0x1b/0x50
[<c0130885>] ? release_console_sem+0x1b5/0x1e0
[<c01306cb>] ? wake_up_klogd+0x3b/0x40
[<c0130899>] ? release_console_sem+0x1c9/0x1e0
[<c0130d5c>] ? vprintk+0x29c/0x410
[<c0123c20>] ? enqueue_entity+0x70/0x210
[<c0124dc0>] ? enqueue_task_fair+0x40/0x50
[<c0124830>] ? try_to_wake_up+0x120/0x2b0
[<c0130eeb>] ? printk+0x1b/0x20
[<c0124877>] ? try_to_wake_up+0x167/0x2b0
[<c01249cb>] ? default_wake_function+0xb/0x10
[<c0143fdb>] ? autoremove_wake_function+0x1b/0x50
[<c01214d8>] ? __wake_up_common+0x48/0x70
[<c0123717>] ? __wake_up+0x37/0x50
[<c01306cb>] ? wake_up_klogd+0x3b/0x40
[<c0130899>] ? release_console_sem+0x1c9/0x1e0
[<c0130d5c>] ? vprintk+0x29c/0x410
[<c0130dc9>] ? vprintk+0x309/0x410
[<c0152bc5>] ? mark_held_locks+0x65/0x80
[<c05710e5>] ? __mutex_unlock_slowpath+0xb5/0x150
[<c0130eeb>] ? printk+0x1b/0x20
[<c0109998>] ? alternatives_smp_switch+0x18/0x1b0
[<c0130eeb>] ? printk+0x1b/0x20
[<c01170a4>] ? __cpu_die+0x74/0x80
[<c054f75a>] ? _cpu_down+0x13a/0x240
[<c0158baf>] ? cpu_maps_update_begin+0xf/0x20
[<c054f88b>] ? cpu_down+0x2b/0x40
[<c0550aa9>] ? store_online+0x39/0x80
[<c0550a70>] ? store_online+0x0/0x80
[<c02f1e7b>] ? sysdev_store+0x2b/0x40
[<c01d56f2>] ? sysfs_write_file+0xa2/0x100
[<c01972e6>] ? vfs_write+0x96/0x130
[<c01d5650>] ? sysfs_write_file+0x0/0x100
[<c01979ad>] ? sys_write+0x3d/0x70
[<c01042a1>] ? sysenter_past_esp+0x6a/0xb1
=======================
Code: 56 30 17 00 5d c3 8d 74 26 00 e6 ed 5d c3 90 90 90 90 90 90 90 90 90 90 90
90 55 89 e5 8d 76 00 0f ae e8 0f 31 8d 76 00 0f ae e8 <5d> c3 8d b6 00 00 00 00
8d bc 27 00 00 00 00 55 89 e5 57 89 c7

I don't know if it helps too much, but I think it can't really be
anything but the printk()s from sched code...


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/