Re: workqueue: WARN at at kernel/workqueue.c:2176

From: Jason J. Herne
Date: Thu May 29 2014 - 12:23:52 EST


On 05/27/2014 10:26 AM, Peter Zijlstra wrote:
On Tue, May 27, 2014 at 10:18:31AM -0400, Jason J. Herne wrote:
On 05/16/2014 12:29 PM, Peter Zijlstra wrote:
On Sat, May 17, 2014 at 12:18:06AM +0800, Lai Jiangshan wrote:
so the scheduler/set_cpus_allowed_ptr()/cpu_active_mask should be the first
place to fix.

I'm not arguing about that, not to mention that this is userspace
exposed and nobody protects that.

But I was expecting kernel stuff that calls it on hotplug to be
serialized thusly, but apparently not so.


Was a final patch posted for this issue? The discussion made it sound like
there were still a few things to figure out before we could resolve this
bug. I can recreate this as needed and I'm happy to test any patches.


https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=sched/urgent&id=6acbfb96976fc3350e30d964acb1dbbdf876d55e

which should make its way to Linus soonish I suppose.


I applied the patch on top of c7208164e66f63e3ec1759b98087849286410741 and I am still hitting the problem.
Should I have applied to a different branch/commit to pick up any other needed changes?

Patch applied:

diff --git a/kernel/cpu.c b/kernel/cpu.c
index a9e710e..247979a 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -726,10 +726,12 @@ void set_cpu_present(unsigned int cpu, bool present)
void set_cpu_online(unsigned int cpu, bool online)
{
- if (online)
+ if (online) {
cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
- else
+ cpumask_set_cpu(cpu, to_cpumask(cpu_active_bits));
+ } else {
cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
+ }
}
void set_cpu_active(unsigned int cpu, bool active)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 44e00ab..86f3890 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5076,7 +5076,6 @@ static int sched_cpu_active(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
switch (action & ~CPU_TASKS_FROZEN) {
- case CPU_STARTING:
case CPU_DOWN_FAILED:
set_cpu_active((long)hcpu, true);
return NOTIFY_OK;

Here is the output from the recreation using this patch:

[ 3634.146233] ------------[ cut here ]------------
[ 3634.146238] WARNING: at kernel/workqueue.c:2176
[ 3634.146239] Modules linked in: ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack xt_CHECKSUM iptable_mangle bridge stp llc ip6table_filter ip6_tables ebtable_nat ebtables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi qeth_l2 tape_3590 tape tape_class vhost_net tun vhost macvtap macvlan eadm_sch qeth ccwgroup zfcp scsi_transport_fc scsi_tgt qdio dasd_eckd_mod dasd_mod dm_multipath [last unloaded: kvm]
[ 3634.146260] CPU: 6 PID: 28009 Comm: kworker/7:0 Not tainted 3.15.0-rc7 #1
[ 3634.146263] Workqueue: \xffffff80 (null)
[ 3634.146264] task: 000000025def32e0 ti: 000000026dca0000 task.ti: 000000026dca0000
[ 3634.146266] Krnl PSW : 0404c00180000000 000000000015ad1a (process_one_work+0x2e6/0x4c0)
[ 3634.146272] R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 EA:3
Krnl GPRS: 0000000000000000 0000000000bc649a 00000002764b0980 0000000000b94f40
[ 3634.146275] 0000000000b94f40 0000000000000000 0000000000000000 0000000000bc6496
[ 3634.146277] 0000000000000000 000000008b65b600 000000008b657000 000000008b657018
[ 3634.146278] 00000002764b0980 0000000000b94f40 000000026dca3dd0 000000026dca3d70
[ 3634.146287] Krnl Code: 000000000015ad0e: 95001000 cli 0(%r1),0
000000000015ad12: a774fece brc 7,15aaae
#000000000015ad16: a7f40001 brc 15,15ad18
>000000000015ad1a: 92011000 mvi 0(%r1),1
000000000015ad1e: a7f4fec8 brc 15,15aaae
000000000015ad22: e31003180004 lg %r1,792
000000000015ad28: 58301024 l %r3,36(%r1)
000000000015ad2c: a73a0001 ahi %r3,1
[ 3634.146299] Call Trace:
[ 3634.146301] ([<000000000015ace8>] process_one_work+0x2b4/0x4c0)
[ 3634.146303] [<000000000015c100>] worker_thread+0x178/0x39c
[ 3634.146305] [<0000000000164ba6>] kthread+0x10e/0x128
[ 3634.146310] [<000000000072d026>] kernel_thread_starter+0x6/0xc
[ 3634.146312] [<000000000072d020>] kernel_thread_starter+0x0/0xc
[ 3634.146313] Last Breaking-Event-Address:
[ 3634.146315] [<000000000015ad16>] process_one_work+0x2e2/0x4c0
[ 3634.146316] ---[ end trace 03f51c9126c24171 ]---

I don't think this output provides anything new. Please let me know if I can gather any more data.

--
-- Jason J. Herne (jjherne@xxxxxxxxxxxxxxxxxx)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/