Re: [PATCH 3/4] sched: WARN when migrating to an offline CPU

From: Levin, Alexander (Sasha Levin)
Date: Fri Oct 06 2017 - 22:08:13 EST


On Fri, Sep 29, 2017 at 01:11:26PM +0200, Peter Zijlstra wrote:
>I can't seem to trigger :-(
>
>Can you please run with the below patch and:
>
> # echo 1 > /proc/sys/kernel/traceoff_on_warning

The call stack trace looks like so:

[ 2073.492089] Unregister pv shared memory for cpu 2
[ 2073.495414] NOHZ: local_softirq_pending 202
[ 2073.516166] ------------[ cut here ]------------
[ 2073.519047] WARNING: CPU: 2 PID: 24 at kernel/sched/core.c:1178 set_task_cpu (kernel/sched/core.c:1157)
[ 2073.520317] Modules linked in:
[ 2073.520816] CPU: 2 PID: 24 Comm: migration/2 Not tainted 4.14.0-rc2-next-20170927+ #256
[ 2073.522046] task: ffff8801ab7cc040 task.stack: ffff8801ab7d8000
[ 2073.522963] RIP: 0010:set_task_cpu (??:?)
[ 2073.523655] RSP: 0018:ffff8801ab807b20 EFLAGS: 00010096
[ 2073.524662] RAX: 0000000000000002 RBX: ffff880189810040 RCX: 1ffff100356f993a
[ 2073.525747] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000046
[ 2073.526845] RBP: ffff8801ab807b60 R08: 0000000000000000 R09: 0000000000000000
[ 2073.527940] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000001e5900
[ 2073.529027] R13: 0000000000000006 R14: 0000000000000002 R15: ffff8801898100a4
[ 2073.530118] FS: 0000000000000000(0000) GS:ffff8801ab800000(0000) knlGS:0000000000000000
[ 2073.531556] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2073.532443] CR2: 00007ffd965f1be8 CR3: 0000000187ede001 CR4: 00000000000606a0
[ 2073.533546] Call Trace:
[ 2073.533944] <IRQ>
[ 2073.534280] detach_task.isra.80 (kernel/sched/fair.c:6816)
[ 2073.534939] load_balance (./include/linux/list.h:78 kernel/sched/fair.c:6902 kernel/sched/fair.c:8204)
[ 2073.535543] ? find_busiest_group (kernel/sched/fair.c:8133)
[ 2073.536252] rebalance_domains (kernel/sched/fair.c:8838)
[ 2073.536903] ? pick_next_task_fair (kernel/sched/fair.c:8787)
[ 2073.537628] ? check_preemption_disabled (lib/smp_processor_id.c:52)
[ 2073.538395] run_rebalance_domains (kernel/sched/fair.c:9062)
[ 2073.539096] ? check_preemption_disabled (lib/smp_processor_id.c:52)
[ 2073.539850] __do_softirq (kernel/softirq.c:284 ./include/linux/jump_label.h:141 ./include/trace/events/irq.h:141 kernel/softirq.c:285)
[ 2073.540424] ? check_preemption_disabled (lib/smp_processor_id.c:52)
[ 2073.541189] irq_exit (kernel/softirq.c:364 kernel/softirq.c:405)
[ 2073.541715] smp_apic_timer_interrupt (./arch/x86/include/asm/irq_regs.h:26 arch/x86/kernel/apic/apic.c:1043)
[ 2073.542452] apic_timer_interrupt (arch/x86/entry/entry_64.S:770)
[ 2073.543116] </IRQ>
[ 2073.543463] RIP: 0010:multi_cpu_stop (??:?)
[ 2073.544173] RSP: 0018:ffff8801ab7dfd88 EFLAGS: 00000292 ORIG_RAX: ffffffffffffff11
[ 2073.545326] RAX: 0000000000000007 RBX: 0000000000000004 RCX: 000000000000000a
[ 2073.546406] RDX: 0000000000000000 RSI: ffffffff9c46eee0 RDI: 0000000000000292
[ 2073.547503] RBP: ffff8801ab7dfde0 R08: 0000000000000000 R09: 0000000000000000
[ 2073.548587] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88012aaf7890
[ 2073.549668] R13: ffff88012aaf78b4 R14: dffffc0000000000 R15: 0000000000000003
[ 2073.550765] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 2073.551501] ? cpu_stop_queue_work (kernel/stop_machine.c:176)
[ 2073.552195] cpu_stopper_thread (kernel/stop_machine.c:480)
[ 2073.552845] ? cpu_stop_create (kernel/stop_machine.c:458)
[ 2073.553462] smpboot_thread_fn (kernel/smpboot.c:164)
[ 2073.554104] ? sort_range (kernel/smpboot.c:107)
[ 2073.554671] ? schedule (./arch/x86/include/asm/bitops.h:324 (discriminator 1) ./include/linux/thread_info.h:79 (discriminator 1) ./include/linux/sched.h:1605 (discriminator 1) kernel/sched/core.c:3435 (discriminator 1))
[ 2073.555230] ? __kthread_parkme (kernel/kthread.c:188)
[ 2073.555892] kthread (kernel/kthread.c:242)
[ 2073.556408] ? sort_range (kernel/smpboot.c:107)
[ 2073.556963] ? kthread_create_on_node (kernel/kthread.c:198)
[ 2073.557693] ret_from_fork (arch/x86/entry/entry_64.S:437)
[ 2073.558261] Code: 09 84 d2 74 05 e8 84 3a 4b 00 f7 83 84 00 00 00 fd ff ff ff 0f 84 a0 f9 ff ff 0f ff e9 99 f9 ff ff e8 78 18 ff ff e9 c7 fd ff ff <0f> ff e9 ea f9 ff ff e8 67 18 ff ff e9 f5 f9 ff ff e8 fd 85 0f

All code
========
0: 09 84 d2 74 05 e8 84 or %eax,-0x7b17fa8c(%rdx,%rdx,8)
7: 3a 4b 00 cmp 0x0(%rbx),%cl
a: f7 83 84 00 00 00 fd testl $0xfffffffd,0x84(%rbx)
11: ff ff ff
14: 0f 84 a0 f9 ff ff je 0xfffffffffffff9ba
1a: 0f ff (bad)
1c: e9 99 f9 ff ff jmpq 0xfffffffffffff9ba
21: e8 78 18 ff ff callq 0xffffffffffff189e
26: e9 c7 fd ff ff jmpq 0xfffffffffffffdf2
2b:* 0f ff (bad) <-- trapping instruction
2d: e9 ea f9 ff ff jmpq 0xfffffffffffffa1c
32: e8 67 18 ff ff callq 0xffffffffffff189e
37: e9 f5 f9 ff ff jmpq 0xfffffffffffffa31
3c: e8 fd 85 0f 00 callq 0xf863e

Code starting with the faulting instruction
===========================================
0: 0f ff (bad)
2: e9 ea f9 ff ff jmpq 0xfffffffffffff9f1
7: e8 67 18 ff ff callq 0xffffffffffff1873
c: e9 f5 f9 ff ff jmpq 0xfffffffffffffa06
11: e8 fd 85 0f 00 callq 0xf8613
[ 2073.561289] ---[ end trace 88455b894f38cbc5 ]---

And quite a few lines of your added trace (lmk if you need more, or all):

<idle>-0 [000] ..s2 104.829161: load_balance: dst_cpu: 6 cpus: 0-1,3-4,6-7,9
<idle>-0 [000] ..s1 104.829172: rebalance_domains: rcu-read-unlock: 6
<idle>-0 [000] ..s1 104.829177: rebalance_domains: rcu-read-lock: 0
<idle>-0 [000] ..s1 104.829179: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
<idle>-0 [000] ..s1 104.829183: rebalance_domains: rcu-read-unlock: 0
<idle>-0 [001] .Ns1 104.829313: rebalance_domains: rcu-read-lock: 1
<idle>-0 [001] .Ns1 104.829317: rebalance_domains: rcu-read-unlock: 1
<idle>-0 [006] .Ns1 104.829446: rebalance_domains: rcu-read-lock: 6
<idle>-0 [006] .Ns1 104.829450: rebalance_domains: rcu-read-unlock: 6
trinity-c28-3303 [006] d..1 104.829663: pick_next_task_fair: rcu-read-lock: 6
trinity-c28-3303 [006] d..1 104.829667: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.829671: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.829702: pick_next_task_fair: rcu-read-unlock: 6
kdevtmpfs-72 [001] d..1 104.830079: pick_next_task_fair: rcu-read-lock: 1
kdevtmpfs-72 [001] d..1 104.830083: load_balance: dst_cpu: 1 cpus: 1,9
kdevtmpfs-72 [001] d..1 104.830087: load_balance: dst_cpu: 1 cpus: 1,9
kdevtmpfs-72 [001] d..1 104.830097: pick_next_task_fair: rcu-read-unlock: 1
trinity-c37-3313 [000] d..1 104.830218: pick_next_task_fair: rcu-read-lock: 0
trinity-c37-3313 [000] d..1 104.830221: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c37-3313 [000] d..1 104.830231: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c37-3313 [000] d..1 104.830237: pick_next_task_fair: rcu-read-unlock: 0
trinity-c28-3303 [006] d..1 104.830912: pick_next_task_fair: rcu-read-lock: 6
trinity-c28-3303 [006] d..1 104.830916: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.830919: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.830950: pick_next_task_fair: rcu-read-unlock: 6
kworker/u26:2-413 [005] ..s1 104.831066: rebalance_domains: rcu-read-lock: 5
kworker/u26:2-413 [005] ..s1 104.831071: rebalance_domains: rcu-read-unlock: 5
rcu_preempt-9 [006] d..1 104.831127: pick_next_task_fair: rcu-read-lock: 6
rcu_preempt-9 [006] d..1 104.831130: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
rcu_preempt-9 [006] d..1 104.831133: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
rcu_preempt-9 [006] d..1 104.831138: pick_next_task_fair: rcu-read-unlock: 6
ksoftirqd/6-51 [006] d..1 104.831284: pick_next_task_fair: rcu-read-lock: 6
ksoftirqd/6-51 [006] d..1 104.831287: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
ksoftirqd/6-51 [006] d..1 104.831290: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
ksoftirqd/6-51 [006] d..1 104.831307: pick_next_task_fair: rcu-read-unlock: 6
kworker/u26:2-5519 [005] d..1 104.831458: pick_next_task_fair: rcu-read-lock: 5
kworker/u26:2-5519 [005] d..1 104.831462: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.831473: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.831479: pick_next_task_fair: rcu-read-unlock: 5
kworker/u26:2-5519 [005] d..1 104.831796: pick_next_task_fair: rcu-read-lock: 5
kworker/u26:2-5519 [005] d..1 104.831799: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] dN.1 104.831823: pick_next_task_fair: rcu-read-unlock: 5
ksoftirqd/1-18 [001] d..1 104.831914: pick_next_task_fair: rcu-read-lock: 1
ksoftirqd/1-18 [001] d..1 104.831918: load_balance: dst_cpu: 1 cpus: 1,9
ksoftirqd/1-18 [001] d..1 104.831928: load_balance: dst_cpu: 1 cpus: 1,9
ksoftirqd/1-18 [001] d..1 104.831934: pick_next_task_fair: rcu-read-unlock: 1
kworker/u26:2-5519 [005] ..s2 104.832087: rebalance_domains: rcu-read-lock: 5
kworker/u26:2-5519 [005] ..s2 104.832091: rebalance_domains: rcu-read-unlock: 5
trinity-c37-3313 [000] ..s2 104.832097: rebalance_domains: rcu-read-lock: 0
trinity-c37-3313 [000] ..s2 104.832101: rebalance_domains: rcu-read-unlock: 0
trinity-c37-3313 [000] d..1 104.832188: pick_next_task_fair: rcu-read-lock: 0
trinity-c37-3313 [000] d..1 104.832190: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c37-3313 [000] d..1 104.832195: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c37-3313 [000] d..1 104.832206: pick_next_task_fair: rcu-read-unlock: 0
kworker/u26:2-5519 [005] d..1 104.832226: pick_next_task_fair: rcu-read-lock: 5
kworker/u26:2-5519 [005] d..1 104.832230: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.832239: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.832244: pick_next_task_fair: rcu-read-unlock: 5
trinity-c12-5517 [008] d..1 104.832468: pick_next_task_fair: rcu-read-lock: 8
trinity-c12-5517 [008] d..1 104.832472: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
trinity-c12-5517 [008] d..1 104.832479: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
kworker/u26:2-5519 [005] d..1 104.832499: pick_next_task_fair: rcu-read-lock: 5
kworker/u26:2-5519 [005] d..1 104.832502: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.832506: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.832520: pick_next_task_fair: rcu-read-unlock: 5
trinity-c28-3303 [006] d..1 104.832533: pick_next_task_fair: rcu-read-lock: 6
trinity-c28-3303 [006] d..1 104.832537: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c12-5517 [008] d..1 104.832537: pick_next_task_fair: rcu-read-unlock: 8
trinity-c28-3303 [006] d..1 104.832540: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.832550: pick_next_task_fair: rcu-read-unlock: 6
ksoftirqd/8-63 [008] d..1 104.832790: pick_next_task_fair: rcu-read-lock: 8
ksoftirqd/8-63 [008] d..1 104.832793: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
ksoftirqd/8-63 [008] d..1 104.832797: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
ksoftirqd/8-63 [008] d..1 104.832809: pick_next_task_fair: rcu-read-unlock: 8
<idle>-0 [005] .Ns1 104.833083: rebalance_domains: rcu-read-lock: 5
<idle>-0 [005] .Ns1 104.833087: rebalance_domains: rcu-read-unlock: 5
trinity-c78-4927 [000] d..1 104.833225: pick_next_task_fair: rcu-read-lock: 0
trinity-c78-4927 [000] d..1 104.833228: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c78-4927 [000] d..1 104.833239: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c78-4927 [000] d..1 104.833260: pick_next_task_fair: rcu-read-unlock: 0
kworker/u26:2-5519 [005] d..1 104.833545: pick_next_task_fair: rcu-read-lock: 5
kworker/u26:2-5519 [005] d..1 104.833548: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.833785: load_balance: dst_cpu: 5 cpus: 1,9
kworker/u26:2-5519 [005] d..1 104.833799: pick_next_task_fair: rcu-read-unlock: 5
trinity-c28-3303 [006] ..s1 104.834090: rebalance_domains: rcu-read-lock: 6
trinity-c28-3303 [006] ..s1 104.834094: rebalance_domains: rcu-read-unlock: 6
rcu_preempt-9 [006] d..1 104.834193: pick_next_task_fair: rcu-read-lock: 6
rcu_preempt-9 [006] d..1 104.834196: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
rcu_preempt-9 [006] d..1 104.834200: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
rcu_preempt-9 [006] d..1 104.834210: pick_next_task_fair: rcu-read-unlock: 6
ksoftirqd/6-51 [006] d..1 104.834309: pick_next_task_fair: rcu-read-lock: 6
ksoftirqd/6-51 [006] d..1 104.834312: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
ksoftirqd/6-51 [006] d..1 104.834315: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
ksoftirqd/6-51 [006] d..1 104.834323: pick_next_task_fair: rcu-read-unlock: 6
<idle>-0 [000] .Ns1 104.834449: rebalance_domains: rcu-read-lock: 0
<idle>-0 [000] .Ns1 104.834452: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
<idle>-0 [000] .Ns2 104.834457: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
<idle>-0 [000] .Ns1 104.834459: rebalance_domains: rcu-read-unlock: 0
trinity-c37-3313 [000] d..1 104.834761: pick_next_task_fair: rcu-read-lock: 0
trinity-c37-3313 [000] d..1 104.834764: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c37-3313 [000] d..1 104.834769: load_balance: dst_cpu: 0 cpus: 0-1,3-4,6-7,9
trinity-c37-3313 [000] d..1 104.834780: pick_next_task_fair: rcu-read-unlock: 0
trinity-c79-3355 [005] .Ns1 104.835721: rebalance_domains: rcu-read-lock: 5
trinity-c79-3355 [005] .Ns1 104.835725: rebalance_domains: rcu-read-unlock: 5
<idle>-0 [008] .Ns1 104.835746: rebalance_domains: rcu-read-lock: 8
<idle>-0 [008] .Ns1 104.835751: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
<idle>-0 [008] .Ns2 104.835759: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
<idle>-0 [008] .Ns1 104.835761: rebalance_domains: rcu-read-unlock: 8
trinity-c28-3303 [006] d..1 104.835817: pick_next_task_fair: rcu-read-lock: 6
trinity-c28-3303 [006] d..1 104.835820: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.835824: load_balance: dst_cpu: 6 cpus: 1,3-4,6-7,9
trinity-c28-3303 [006] d..1 104.835829: pick_next_task_fair: rcu-read-unlock: 6
rcu_sched-10 [008] d..1 104.835871: pick_next_task_fair: rcu-read-lock: 8
rcu_sched-10 [008] d..1 104.835874: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
rcu_sched-10 [008] d..1 104.835879: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
rcu_sched-10 [008] d..1 104.835915: pick_next_task_fair: rcu-read-unlock: 8
trinity-c63-3339 [008] d..1 104.836114: pick_next_task_fair: rcu-read-lock: 8
trinity-c63-3339 [008] d..1 104.836117: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
trinity-c63-3339 [008] d..1 104.836122: load_balance: dst_cpu: 8 cpus: 0-1,3-4,6-7,9
trinity-c63-3339 [008] d..1 104.836133: pick_next_task_fair: rcu-read-unlock: 8
ksoftirqd/1-18 [001] ..s. 104.836352: rebalance_domains: rcu-read-lock: 1
ksoftirqd/1-18 [001] ..s. 104.836356: rebalance_domains: rcu-read-unlock: 1
<idle>-0 [000] .Ns1 104.836602: rebalance_domains: rcu-read-lock: 0
<idle>-0 [000] .Ns1 104.836606: rebalance_domains: rcu-read-unlock: 0
migration/0-12 [000] ..s1 104.837059: rebalance_domains: rcu-read-lock: 0
migration/0-12 [000] ..s1 104.837063: rebalance_domains: rcu-read-unlock: 0
migration/6-50 [006] ..s2 104.837087: rebalance_domains: rcu-read-lock: 6
migration/6-50 [006] ..s2 104.837092: rebalance_domains: rcu-read-unlock: 6
<idle>-0 [001] .Ns1 104.843088: rebalance_domains: rcu-read-lock: 1
<idle>-0 [001] .Ns1 104.843093: rebalance_domains: rcu-read-unlock: 1
<idle>-0 [008] .Ns1 104.843269: rebalance_domains: rcu-read-lock: 8
<idle>-0 [008] .Ns1 104.843273: rebalance_domains: rcu-read-unlock: 8
migration/2-24 [002] ..s1 104.855994: rebalance_domains: rcu-read-lock: 2
migration/2-24 [002] ..s1 104.856000: load_balance: dst_cpu: 2 cpus: 6

--

Thanks,
Sasha