Re: WARNING: at kernel/workqueue.c:845

From: Tejun Heo
Date: Fri Aug 29 2014 - 08:37:45 EST


(cc'ing Lai, hi!)

There have been some changes in how workqueue handles CPU hotplug
recently. Maybe it's related? Lai, can you please take a look?
Christian also added that the problem can be reproduced on 3.16 w/
lower frequency.

Thanks.

On Fri, Aug 29, 2014 at 12:50:35PM +0200, Christian Borntraeger wrote:
> Tejun,
>
> with kvm/next (pretty close to 3.17-rc1) as KVM guest I get the following warning in one of my stress tests:
>
> [ 0.296047] ------------[ cut here ]------------
> [ 0.296050] WARNING: at kernel/workqueue.c:809
> [ 0.296051] Modules linked in:
> [ 0.296054] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.17.0-rc1+ #172
> [ 0.296056] task: 0000000000934618 ti: 000000000091c000 task.ti: 000000000091c000
> [ 0.296062] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
> Krnl GPRS: 0000000000000001 00000000063bed00 0000000000996740 0000000000000000
> [ 0.296065] 000000000000000d 000000000000036a ffffffff00000000 0000000000000000
> [ 0.296067] 000000000620e400 0000000000001201 0000000000001201 04000000009f6700
> [ 0.296068] 000000000639b700 0000000000000000 00000000001453e8 00000000063fbc80
> [ 0.296078] Krnl Code: 0000000000145416: 95002000 cli 0(%r2),0
> 000000000014541a: a774fff4 brc 7,145402
> #000000000014541e: a7f40001 brc 15,145420
> [ 0.296083] TCP: cubic registered
> [ 0.296084]
> >0000000000145422: 92012000 mvi 0(%r2),1
> [ 0.296086]
> 0000000000145426: a7f4ffee brc 15,145402
> 000000000014542a: 0707 bcr 0,%r7
> 000000000014542c: ebdff0800024 stmg %r13,%r15,128(%r15)
> [ 0.296092] Initializing XFRM netlink socket
> [ 0.296094]
> 0000000000145432: a7f13fe0 tmll %r15,16352
> [ 0.296096] Call Trace:
> [ 0.296099] ([<0000000000001201>] 0x1201)
> [ 0.296102] [<00000000001545dc>] ttwu_do_activate.constprop.97+0x64/0x7c
> [ 0.296103] [<00000000001550ce>] sched_ttwu_pending+0x7e/0xd4
> [ 0.296104] [<0000000000156bea>] scheduler_ipi+0x62/0x168
> [ 0.296107] [<0000000000113482>] smp_handle_ext_call+0xbe/0xdc
> [ 0.296111] [<000000000010b3dc>] do_ext_interrupt+0xb4/0xd4
> [ 0.296113] [<00000000001833be>] handle_irq_event_percpu+0x76/0x204
> [ 0.296115] [<00000000001870e8>] handle_percpu_irq+0x6c/0x98
> [ 0.296116] [<0000000000182a02>] generic_handle_irq+0x46/0x68
> [ 0.296117] [<000000000010b78a>] do_IRQ+0x5e/0x84
> [ 0.296120] [<0000000000634840>] ext_skip+0x42/0x46
> [ 0.296121] [<0000000000633fce>] vtime_stop_cpu+0x4a/0x9c
> [ 0.296122] ([<0000000000000000>] (null))
> [ 0.296124] [<0000000000103816>] arch_cpu_idle+0x92/0xa0
> [ 0.296126] [<000000000016bb9a>] cpu_startup_entry+0x15a/0x228
> [ 0.296127] [<00000000009a5ae4>] start_kernel+0x408/0x418
> [ 0.296128] [<0000000000100020>] _stext+0x20/0x80
> [ 0.296129] Last Breaking-Event-Address:
> [ 0.296130] [<000000000014541e>] wq_worker_waking_up+0x5e/0x6c
> [ 0.296133] ---[ end trace 68915e61d289d806 ]---
> [ 0.296137] reboot: Restarting system
> [ 0.296141] ------------[ cut here ]------------
>
> The test is basically to start 50 KVM guests that only have a kernel + busybox ramdisk with rcS calling reboot.
> One or two guests of these 50 have thing warning pretty soon. With 3.16 as guest everything runs fine.
>
> Do you have any idea what might be wrong or do I need to bisect?
>
> Christian
>

--
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/