Re: Commit 7eaceaccab5f40 causing boot hang.

From: Tejun Heo
Date: Thu Mar 31 2011 - 08:11:12 EST


Hello,

On Thu, Mar 31, 2011 at 12:37:42PM +0200, Jens Axboe wrote:
> It seems to hard hang, looks very odd:
>
> [ 84.056007] BUG: soft lockup - CPU#0 stuck for 67s! [kworker/0:2:743]
> [ 84.056008] Modules linked in:
> [ 84.056008] irq event stamp: 334859658
> [ 84.056008] hardirqs last enabled at (334859657): [<ffffffff815c40c7>] _raw_spin_unlock_irq+0x2b/0x30
> [ 84.056008] hardirqs last disabled at (334859658): [<ffffffff815c42e7>] save_args+0x67/0x70
> [ 84.056008] softirqs last enabled at (334855538): [<ffffffff81044819>] __do_softirq+0x1a3/0x1c2
> [ 84.056008] softirqs last disabled at (334855525): [<ffffffff815cb9cc>] call_softirq+0x1c/0x30
> [ 84.056008] CPU 0
> [ 84.056008] Modules linked in:
> [ 84.056008]
> [ 84.056008] Pid: 743, comm: kworker/0:2 Not tainted 2.6.39-rc1+ #12 Bochs Bochs
> [ 84.056008] RIP: 0010:[<ffffffff815c40c9>] [<ffffffff815c40c9>] _raw_spin_unlock_irq+0x2d/0x30
> [ 84.056008] RSP: 0018:ffff88003d343d98 EFLAGS: 00000202
> [ 84.056008] RAX: 0000000013f58d89 RBX: 0000000000000006 RCX: ffff88003d2c5998
> [ 84.056008] RDX: 0000000000000006 RSI: ffff88003d343da0 RDI: ffff88003db19508
> [ 84.056008] RBP: ffff88003d343da0 R08: ffff88003fc15c00 R09: 0000000000000001
> [ 84.056008] R10: ffffffff81e0d040 R11: ffff88003d343d60 R12: ffffffff815cb18e
> [ 84.056008] R13: 0000000000000001 R14: ffff88003d2c5998 R15: ffffffff81069aec
> [ 84.056008] FS: 0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
> [ 84.056008] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 84.056008] CR2: 000000000060d828 CR3: 000000003d3f8000 CR4: 00000000000006f0
> [ 84.056008] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 84.056008] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 84.056008] Process kworker/0:2 (pid: 743, threadinfo ffff88003d342000, task ffff88003db18f60)
> [ 84.056008] Stack:
> [ 84.056008] ffff88003d2c5870 ffff88003d343dc0 ffffffff812171d3 ffff88003fc15c00
> [ 84.056008] ffff88003d31e6c0 ffff88003d343e50 ffffffff81053e99 ffffffff81053e0b
> [ 84.056008] ffff88003d342010 ffff88003db18f60 0000000000000046 ffff88003fc15c05
> [ 84.056008] Call Trace:
> [ 84.056008] [<ffffffff812171d3>] blk_delay_work+0x32/0x36
> [ 84.056008] [<ffffffff81053e99>] process_one_work+0x230/0x397
> [ 84.056008] [<ffffffff81053e0b>] ? process_one_work+0x1a2/0x397
> [ 84.056008] [<ffffffff8105612a>] worker_thread+0x136/0x255
> [ 84.056008] [<ffffffff81055ff4>] ? manage_workers+0x190/0x190
> [ 84.056008] [<ffffffff8105974a>] kthread+0x7d/0x85
> [ 84.056008] [<ffffffff815cb8d4>] kernel_thread_helper+0x4/0x10
> [ 84.056008] [<ffffffff815c4440>] ? retint_restore_args+0xe/0xe
> [ 84.056008] [<ffffffff810596cd>] ? __init_kthread_worker+0x56/0x56
> [ 84.056008] [<ffffffff815cb8d0>] ? gs_change+0xb/0xb
> [ 84.056008] Code: 01 00 00 00 48 89 e5 53 48 89 fb 48 83 c7 18 48 83 ec 08 48 8b 55 08 e8 11 7b aa ff 48 89 df e8 03 05 c7 ff e8 f3 5e aa ff fb 5e <5b> c9 c3 55 48 89 e5 41 54 49 89 fc 48 8b 55 08 48 83 c7 18 53

Is CONFIG_PREEMPT[_VOLUNTARY] set? The soft lockup detection works by
checking whether a high pri RT task is scheduled periodically and busy
looping in kernel code with preemption disabled or CONFIG_PREEMPT
disabled would trigger it. The backtrace doesn't mean the CPU is
stuck there not progressing. It just shows where the CPU is at the
moment of triggering and the softlockup triggering itself indicates
that IRQ and bottom halves are running fine.

I think more likely failure mode is something is looping in block path
expecting someone else to do something but as it is busy looping and
preemption isn't enabled the someone else can't proceed. We had a
similar problem in md lately.

http://thread.gmane.org/gmane.linux.raid/33020

If this is the case, a good way to debug would be triggering sysrq-l
multiple times while the machine is hung and see where the CPU is busy
looping.

Thanks.

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/