sched/fair: Hard lockup from idle_balance()/task_numa_migrate() race
From: Ed Swierk
Date: Mon Oct 16 2017 - 19:11:55 EST
Ping for Peter, Ingo and other sched maintainers:
I'd appreciate any feedback on this hard lockup issue, which occurred
on a system running kernel 4.4.52-grsec.
To recap: a dual-socket Xeon (E5 v4) server system had been running a
bunch of KVM workloads just fine for over 6 weeks. Suddenly hard
lockups occurred on cpu 13 in task_numa_migrate(), and cpu 0 in
idle_balance(). That condition persisted until the system was rebooted
after a couple of hours, and meanwhile all manner of soft lockups and
even SATA errors appeared.
[3851435.777762] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
[3851435.784582] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
[3851435.874884] irq event stamp: 3828060294
[3851435.879398] hardirqs last enabled at (3828060293): [<ffffffff810db86c>] ffffffff810db86c
[3851435.888810] hardirqs last disabled at (3828060294): [<ffffffff8170ff74>] ffffffff8170ff74
[3851435.898254] softirqs last enabled at (3828060156): [<ffffffff8107f31e>] ffffffff8107f31e
[3851435.907666] softirqs last disabled at (3828060139): [<ffffffff8107f648>] ffffffff8107f648
[3851435.917094] CPU: 13 PID: 2146 Comm: pocviexg Not tainted 4.4.52-grsec #1
[3851435.925252] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
[3851435.937473] task: ffff881c9ff35c00 ti: ffff881c9ff372b0 task.ti: ffff881c9ff372b0
[3851435.946104] RIP: 0010:[<ffffffff810ad854>] [<ffffffff810ad854>] try_to_wake_up+0xe4/0x500
[3851435.955614] RSP: 0000:ffffc90016beba40 EFLAGS: 00000002
[3851435.961780] RAX: 0000000000000000 RBX: ffff88102cc8e3d8 RCX: 0000000000000001
[3851435.970016] RDX: 0000000000000001 RSI: 00000000000001e4 RDI: 0000000000000000
[3851435.978259] RBP: ffffc90016beba80 R08: 0000000000000000 R09: 0000000000000001
[3851435.986493] R10: 00000000000000e0 R11: ffff881c9ff35c00 R12: 0000000000000082
[3851435.994727] R13: ffff88102cc8dc00 R14: 00000000001d3f00 R15: ffffffff81f25060
[3851436.002962] FS: 00006e79227fc700(0000) GS:ffff882034000000(0000) knlGS:0000000000000000
[3851436.012266] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3851436.018920] CR2: 00006e79227fba38 CR3: 000000201d2fe000 CR4: 00000000003626f0
[3851436.027155] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[3851436.035391] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[3851436.043626] Stack:
[3851436.046098] ffff8820341ce620 ffff8820341ce608 0000000000000000 ffff8820341ce600
[3851436.054695] ffffc90016bebb48 000000000000000d 0000000000000000 ffff881033fce600
[3851436.063259] ffffc90016beba90 ffffffff810adc80 ffffc90016bebbe8 ffffffff8112c830
[3851436.071868] Call Trace:
[3851436.074862] [<ffffffff810adc80>] wake_up_process+0x10/0x20 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1997
[3851436.081317] [<ffffffff8112c830>] stop_two_cpus+0x1b0/0x280 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L235
[3851436.087778] [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
[3851436.094237] [<ffffffff8112c490>] ? cpu_stop_park+0x40/0x40
[3851436.100715] [<ffffffff810ad560>] ? __migrate_swap_task.part.94+0x70/0x70
[3851436.108539] [<ffffffff810ade01>] migrate_swap+0xf1/0x1f0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L1408
[3851436.114800] [<ffffffff810b9d55>] task_numa_migrate+0x345/0xb60 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1572
[3851436.121656] [<ffffffff811092e3>] ? futex_wait_queue_me+0xf3/0x160
[3851436.128801] [<ffffffff810ba5e3>] numa_migrate_preferred+0x73/0x80 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L1597
[3851436.135936] [<ffffffff810bb944>] task_numa_fault+0x7a4/0xee0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L2137
[3851436.142604] [<ffffffff810bafe0>] ? should_numa_migrate_memory+0x50/0x130
[3851436.150437] [<ffffffff811b2db6>] handle_mm_fault+0xfd6/0x1e10 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/mm/memory.c#L3503
[3851436.157189] [<ffffffff811b1e28>] ? handle_mm_fault+0x48/0x1e10
[3851436.164051] [<ffffffff810a85aa>] ? ___might_sleep+0x1ea/0x220
[3851436.170804] [<ffffffff8106405f>] __do_page_fault+0x1bf/0x460
[3851436.177475] [<ffffffff8110ca44>] ? sys_futex+0x74/0x230
[3851436.183642] [<ffffffff8106432c>] do_page_fault+0x2c/0x40
[3851436.189909] [<ffffffff817128b8>] page_fault+0x28/0x30
[3851436.195892] Code: 0f 85 e0 02 00 00 41 8b 85 b8 16 00 00 89 45 d0 41 8b 7d 4c 49 c7 c6 00 3f 1d 00 85 ff 0f 85 68 01 00 00 41 8b 4d 30 85 c9 74 0a <f3> 90 41 8b 55 30 85 d2 75 f6 49 8b 45 00 a8 02 b8 00 00 00 00
[3851438.906435] NMI watchdog: Watchdog detected hard LOCKUP on cpu 0
[3851438.913158] Modules linked in: vhost_net tun vhost macvtap 8021q garp mrp drbg ansi_cprng dm_crypt algif_skcipher af_alg ip_gre ip_tunnel gre macvlan ebtable_filter ebtables openvswitch nf_defrag_ipv6 libcrc32c nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG xt_limit xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables iTCO_wdt iTCO_vendor_support x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul raid10 crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac lpc_ich mfd_core mei_me mei i2c_i801 ioatdma ipmi_ssif ipmi_msghandler acpi_pad acpi_cpufreq squashfs lz4_decompress xhci_pci xhci_hcd ixgbe mdio vxlan ip6_udp_tunnel udp_tunnel ptp pps_core dca ehci_pci ehci_hcd usbcore ahci libahci usb_common tpm_tis
[3851439.003452] irq event stamp: 25935634
[3851439.007769] hardirqs last enabled at (25935633): [<ffffffff81710341>] ffffffff81710341
[3851439.017010] hardirqs last disabled at (25935634): [<ffffffff8170ff74>] ffffffff8170ff74
[3851439.026226] softirqs last enabled at (25935520): [<ffffffff8107f31e>] ffffffff8107f31e
[3851439.035441] softirqs last disabled at (25935445): [<ffffffff8107f648>] ffffffff8107f648
[3851439.044657] CPU: 0 PID: 10 Comm: migration/0 Not tainted 4.4.52-grsec #1
[3851439.052892] Hardware name: Intel S2600WTTR, BIOS SE5C610.86B.01.01.0016.C1.033120161139 03/31/2016
[3851439.065111] task: ffff88102cc8dc00 ti: ffff88102cc8f2b0 task.ti: ffff88102cc8f2b0
[3851439.073747] RIP: 0010:[<ffffffff813c4935>] [<ffffffff813c4935>] delay_tsc+0x25/0x70
[3851439.082679] RSP: 0000:ffffc900006ebb40 EFLAGS: 00000002
[3851439.088852] RAX: 000000002591f96b RBX: ffff8820341ce608 RCX: 0000000000000018
[3851439.101462] RDX: 001cab2e00000000 RSI: 001cab2e2591f96b RDI: 0000000000000001
[3851439.109695] RBP: ffffc900006ebb40 R08: 0000000000000000 R09: 0000000000000000
[3851439.117950] R10: 0000000000000038 R11: 0000000000000000 R12: 0000000021d37081
[3851439.126198] R13: 0000000000000001 R14: 000000007d234b28 R15: 0000000000000000
[3851439.134440] FS: 0000000000000000(0000) GS:ffff881033e00000(0000) knlGS:0000000000000000
[3851439.143741] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[3851439.150393] CR2: 000000c420148008 CR3: 0000000001f36000 CR4: 00000000003626f0
[3851439.158627] Stack:
[3851439.161096] ffffc900006ebb50 ffffffff813c487a ffffc900006ebb80 ffffffff810dba87
[3851439.169669] ffff8820341ce608 ffff8820341ce620 0000000000000086 00000000000006ae
[3851439.178228] ffffc900006ebbb0 ffffffff817101a2 ffffffff8112c1cb ffff8820341ce600
[3851439.186825] Call Trace:
[3851439.189784] [<ffffffff813c487a>] __delay+0xa/0x10
[3851439.195385] [<ffffffff810dba87>] do_raw_spin_lock+0x87/0x160
[3851439.202062] [<ffffffff817101a2>] _raw_spin_lock_irqsave+0xa2/0xb0
[3851439.209238] [<ffffffff8112c1cb>] ? cpu_stop_queue_work+0x2b/0x70
[3851439.216287] [<ffffffff8112c1cb>] cpu_stop_queue_work+0x2b/0x70 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L89
[3851439.223149] [<ffffffff8112c92b>] stop_one_cpu_nowait+0x2b/0x30 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/stop_machine.c#L310
[3851439.229999] [<ffffffff810c1127>] load_balance+0x827/0x960 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L7184
[3851439.236371] [<ffffffff810eae3d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
[3851439.243995] [<ffffffff810c1608>] pick_next_task_fair+0x3a8/0x750 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/fair.c#L5348
[3851439.251062] [<ffffffff810c14db>] ? pick_next_task_fair+0x27b/0x750
[3851439.258289] [<ffffffff81708fdd>] __schedule+0x1ad/0xbd0 https://github.com/skyportsystems/linux/blob/4.4.52-grsec/kernel/sched/core.c#L3196
[3851439.264456] [<ffffffff810d461d>] ? trace_hardirqs_on_caller+0x13d/0x1d0
[3851439.272204] [<ffffffff81709a37>] schedule+0x37/0x80
[3851439.277984] [<ffffffff810a51cb>] smpboot_thread_fn+0x16b/0x1e0
[3851439.284822] [<ffffffff810a5060>] ? sort_range+0x20/0x20
[3851439.291001] [<ffffffff810a124c>] kthread+0xfc/0x120
[3851439.296796] [<ffffffff8170adf1>] ? wait_for_completion+0xe1/0x110
[3851439.303936] [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
[3851439.311464] [<ffffffff81710aae>] ret_from_fork+0x3e/0x70
[3851439.317738] [<ffffffff810a1150>] ? kthread_create_on_node+0x240/0x240
[3851439.325281] Code: 00 5d c3 0f 1f 00 55 65 ff 05 b8 65 c4 7e 48 89 e5 65 44 8b 05 35 48 c4 7e 0f ae e8 0f 31 48 c1 e2 20 48 89 d6 48 09 c6 0f ae e8 <0f> 31 48 c1 e2 20 48 09 d0 48 89 c1 48 29 f1 48 39 cf 76 27 65
[3851450] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
[3851450] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
[3851450] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
[3851478] NMI watchdog: BUG: soft lockup - CPU#8 stuck for 22s! [pocviexg:2136]
[3851478] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
[3851478] NMI watchdog: BUG: soft lockup - CPU#14 stuck for 22s! [CPU 1/KVM:10302]
[3851482] BUG: spinlock lockup suspected on CPU#0, migration/0/10
[3851482] Sending NMI to all CPUs:
[3851482] NMI backtrace for cpu 0
[3851483] NMI backtrace for cpu 1
[3851483] NMI backtrace for cpu 2
[3851483] NMI backtrace for cpu 3
[3851483] NMI backtrace for cpu 4
[3851484] ata8.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
[3851484] ata7.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen **
[3851484] NMI backtrace for cpu 5
[3851484] NMI backtrace for cpu 6
...
[3851489] NMI backtrace for cpu 24
[3851489] ata8.00: qc timeout (cmd 0xec) **
[3851489] ata7.00: qc timeout (cmd 0xec) **
[3851489] NMI backtrace for cpu 25
...
[3851490] NMI backtrace for cpu 31
[3851499] ata8.00: qc timeout (cmd 0xec) **
[3851499] ata7.00: qc timeout (cmd 0xec) **
[3851510] NMI watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [CPU 0/KVM:9662]
...and so on for 2 more hours, until the system was finally rebooted
Based on the addresses in the stack and registers, here's what I think
happened.
On cpu 13:
- task_numa_fault() calls task_numa_migrate(), which selects the task
on cpu 0 as the dst_task.
- migrate_swap() calls stop_two_cpus(), which acquires the cpu_stopper
locks for the dst_cpu (cpu 0, at 0xffff881033fce600) and src_cpu
(cpu X, at 0xffff8820341ce600).
- stop_two_cpus() calls wake_up_process() on the lower-numbered cpu
first, which has to be cpu 0.
- wake_up_process() spins until the cpu 0 task (at 0xffff88102cc8dc00)
is no longer on_cpu.
On cpu 0:
- pick_next_task_fair() calls idle_balance(). According to the "This
is OK" comment, current is on_cpu at this point.
- idle_balance() calls load_balance() for dst_cpu 0.
- load_balance() decides to move a task from cpu X, so calls
stop_one_cpu_nowait() on cpu X.
- stop_one_cpu_nowait() spins trying to acquire the cpu_stopper lock
for cpu X (at 0xffff8820341ce600).
So idle_balance() on cpu 0 is stuck waiting for task_numa_fault() to
move a task to cpu 0, which is blocked on idle_balance() completing.
Meanwhile, the SATA controller interrupt handler, which is placed on
cpu 0, doesn't get a chance to run, causing the command failure
messages from ata7 and ata8.
Also, it appears that task_numa_fault() tries to migrate current, so
the src_cpu X used by task_numa_migrate() is cpu 13 in this
case. Though the key issue is that both task_numa_migrate() and
idle_balance() are trying to stop the same cpu, regardless of whether
it's the cpu task_numa_migrate() is running on.
Is this a plausible theory?
This is the only instance of a hard lockup among a number of identical
systems running the same kernel, so I'm not seeing an easy way to
reproduce this issue, but any suggestions would be welcome.
--Ed