Re: rcu stalls and soft lockups with recent kernels

From: Ion Badulescu
Date: Wed Mar 16 2016 - 12:22:06 EST


Just following up to my own email:

It turns out that we can eliminate the RCU stalls by changing from CONFIG_RCU_NOCB_CPU_ALL to CONFIG_RCU_NOCB_CPU_NONE. Letting each cpu handle its own RCU callbacks completely fixes the problems for us.

Now, CONFIG_NO_HZ_FULL and CONFIG_RCU_NOCB_CPU_ALL is the default config for fedora and rhel7. Ho-humm... anybody interesting in tracking this down further?

The bug is likely to be a missing wakeup for the rcuo* threads. It would explain why the stalls get resolved eventually, maybe when another RCU callback gets scheduled (properly this time!) and the threads wake up and process their entire queue. But it's just speculation at this point.

And one final point: the RCU stall stack traces are actually rather useless when the RCU callbacks are delegated to kernel threads. What one particular cpu may be doing when it's fallen behind on RCU grace periods isn't very interesting, given that its rcuo* threads could be running anywhere, on any cpu.

Thanks,
-Ion


On 02/04/2016 02:12 PM, Ion Badulescu wrote:
Hello,

We run a compute cluster of about 800 or so machines here, which makes
heavy use of NFS and fscache (on a dedicated local drive with an ext4
filesystem) and also exercises the other local drives pretty hard. All
the compute jobs run as unprivileged users with SCHED_OTHER scheduling,
nice level 1.

Over the past month or two, we've been seeing some strange and seemingly
random rcu stalls and soft cpu lockups when running jobs on the cluster.
The stalls are bad enough to cause hard drives to get kicked out of MD
arrays, and even to get offlined altogether by the SCSI layer. The
kernel running on these machines is based on 3.18.12 with some local
patches, which we've been running quite happily since early May 2015.
It's unclear what started triggering the stalls after all these months,
as we've been unable to correlate them to any particular jobs. The one
thing that's clear is that they tend to happen in bunches on multiple
machines at the same time, so whatever it is it's some centralized
condition that triggers them. It could be a certain type of job, or it
could be the state of the centralized NFS servers they access.

In an effort to solve the issue and isolate its cause, we upgraded the
kernel on about 80 of those machines to the latest 4.4.1 kernel, this
time keeping out most of the local patches we had. We also enabled some
more kernel debugging options, including lockdep. Only a few local
patches were kept in:
- printing some extra information in sysrq
- disabling audit-to-printk
- perf build fixes
- changing the slab calculate_order algorithm to favor small allocation
orders whenever possible.

The full kernel config is available at
http://www.badula.org/config-4.4.1 and the combined patch with our local
changes is at http://www.badula.org/linux-ion-git-4.4.1.diff

Unfortunately, the stalls did not go away. Many of the machines running
4.4.1 hit stalls with stack traces in the sunrpc transmit path, similar
to this one:


Feb 2 04:26:02 INFO: rcu_sched self-detected stall on CPUINFO:
rcu_sched detected stalls on CPUs/tasks:
Feb 2 04:26:02 6-...: (20999 ticks this GP) idle=cf1/140000000000001/0
softirq=9090641/9090641 fqs=7000
Feb 2 04:26:02 (detected by 5, t=21003 jiffies, g=1733453, c=1733452, q=0)
Feb 2 04:26:02 running task 0 328895 2 0x00080088 6
FAIR -20 6
Feb 2 04:26:02
Feb 2 04:26:02 ffffffffffffff10 ffffffff8152abaf 0000000000000000
0000000000000004
Feb 2 04:26:02 000000043121b8b0 ffff881001149808 0000000000000296
0208002000000000
Feb 2 04:26:02 0000000000000296 ffff88003121b908 ffffffff810968fb
ffff8810010ed9c8
Feb 2 04:26:02 Call Trace:
Feb 2 04:26:02 [<ffffffff8152abaf>] ? _raw_spin_lock_irqsave+0xf/0x40
Feb 2 04:26:02 [<ffffffff810968fb>] ? finish_wait+0x6b/0x90
Feb 2 04:26:02 [<ffffffff814694aa>] ? sk_stream_wait_memory+0x24a/0x2d0
Feb 2 04:26:02 [<ffffffff810966a0>] ? woken_wake_function+0x20/0x20
Feb 2 04:26:02 [<ffffffff8152a8da>] ? _raw_spin_unlock_bh+0x1a/0x20
Feb 2 04:26:02 [<ffffffff8145d22d>] ? release_sock+0xfd/0x150
Feb 2 04:26:02 [<ffffffff814bfc16>] ? tcp_sendpage+0xd6/0x5e0
Feb 2 04:26:02 [<ffffffff814e7bc0>] ? inet_sendpage+0x50/0xe0
Feb 2 04:26:02 [<ffffffffa03b4ea5>] ? xs_nospace+0x75/0xf0 [sunrpc]
Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb 2 04:26:02 [<ffffffff81043946>] ? read_hpet+0x16/0x20
Feb 2 04:26:02 [<ffffffff810b7ec2>] ? ktime_get+0x52/0xc0
Feb 2 04:26:02 [<ffffffffa03b2133>] ? xprt_transmit+0x63/0x3a0 [sunrpc]
Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb 2 04:26:02 [<ffffffff8152a8da>] ? _raw_spin_unlock_bh+0x1a/0x20
Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb 2 04:26:02 [<ffffffffa03ae538>] ? call_transmit+0x1d8/0x2c0 [sunrpc]
Feb 2 04:26:02 [<ffffffffa03ae360>] ? call_transmit_status+0x110/0x110
[sunrpc]
Feb 2 04:26:02 [<ffffffffa03bb4e9>] ? __rpc_execute+0x89/0x3c0 [sunrpc]
Feb 2 04:26:02 [<ffffffff81081f15>] ? finish_task_switch+0x125/0x260
Feb 2 04:26:02 [<ffffffffa03bb8e5>] ? rpc_async_schedule+0x15/0x20
[sunrpc]
Feb 2 04:26:02 [<ffffffff8106fad8>] ? process_one_work+0x148/0x450
Feb 2 04:26:02 [<ffffffff8106ff12>] ? worker_thread+0x132/0x600
Feb 2 04:26:02 [<ffffffff81083862>] ? default_wake_function+0x12/0x20
Feb 2 04:26:02 [<ffffffff81096566>] ? __wake_up_common+0x56/0x90
Feb 2 04:26:02 [<ffffffff8106fde0>] ? process_one_work+0x450/0x450
Feb 2 04:26:02 [<ffffffff8106fde0>] ? process_one_work+0x450/0x450
Feb 2 04:26:02 [<ffffffff8107511c>] ? kthread+0xcc/0xf0
Feb 2 04:26:02 [<ffffffff81075050>] ?
kthread_freezable_should_stop+0x70/0x70
Feb 2 04:26:02 [<ffffffff8152b0df>] ? ret_from_fork+0x3f/0x70
Feb 2 04:26:02 [<ffffffff81075050>] ?
kthread_freezable_should_stop+0x70/0x70


So we added a cond_resched() call inside the loop in __rpc_execute, if
the loop made more than 20 iterations. While clearly just a bandaid, and
maybe also in the wrong place, we just wanted to see if it would help.
And indeed it did, because the very common sunrpc-related stalls went
away. Instead, this ext4-related stall started showing up:


[91919.498218] INFO: rcu_sched self-detected stall on CPU
[91919.498221] INFO: rcu_sched detected stalls on CPUs/tasks:
[91919.498235] 15-...: (4 ticks this GP) idle=dbf/140000000000001/0
softirq=16345141/16345141 fqs=1
[91919.498238] (detected by 10, t=178100 jiffies, g=4061489, c=4061488,
q=0)
[91919.498239] Task dump for CPU 15:
[91919.498249] condor_starter R running task 0 506564 502795
0x00080080 15 FAIR 0 0-23 10504 2591
[91919.498252] 0000000000000004 ffff880049bb7958 0000000000000246
0000000000000002
[91919.498254] 0000000000000000 ffff880049bb7938 ffffffff812c83b6
0000000000000000
[91919.498255] ffffffff81156070 ffff880000000000 ffffffff00000000
0000000000000246
[91919.498255] Call Trace:
[91919.498261] [<ffffffff812c83b6>] ? __radix_tree_lookup+0x36/0x100
[91919.498265] [<ffffffff81156070>] ? find_get_pages+0x280/0x280
[91919.498267] [<ffffffff81156165>] ? find_get_entry+0xf5/0x200
[91919.498268] [<ffffffff8115617d>] ? find_get_entry+0x10d/0x200
[91919.498270] [<ffffffff81156070>] ? find_get_pages+0x280/0x280
[91919.498272] [<ffffffff81158288>] ? pagecache_get_page+0x38/0x1d0
[91919.498275] [<ffffffff8120bff2>] ? alloc_buffer_head+0x42/0x50
[91919.498277] [<ffffffff8120c86d>] ? __find_get_block_slow+0x4d/0x160
[91919.498280] [<ffffffff8159f35b>] ? _raw_spin_unlock+0x2b/0x40
[91919.498281] [<ffffffff8120d74c>] ? create_empty_buffers+0x9c/0xc0
[91919.498284] [<ffffffff810aabb0>] ? mark_held_locks+0x70/0x90
[91919.498285] [<ffffffff815a00d7>] ? retint_kernel+0x10/0x10
[91919.498286] [<ffffffff810aae7f>] ? trace_hardirqs_on_caller+0xff/0x1c0
[91919.498290] [<ffffffff81001017>] ? trace_hardirqs_on_thunk+0x17/0x19
[91919.498292] [<ffffffff810c40cc>] ? rcu_irq_exit+0x6c/0xb0
[91919.498293] [<ffffffff815a00d7>] ? retint_kernel+0x10/0x10
[91919.498296] [<ffffffff812ceb95>] ?
copy_user_enhanced_fast_string+0x5/0x10
[91919.498299] [<ffffffff812d54ec>] ?
iov_iter_copy_from_user_atomic+0x9c/0x240
[91919.498300] [<ffffffff8115535e>] ? generic_perform_write+0xfe/0x1d0
[91919.498302] [<ffffffff811ed3f2>] ? generic_update_time+0x82/0xd0
[91919.498305] [<ffffffff81157466>] ?
__generic_file_write_iter+0x196/0x1f0
[91919.498315] [<ffffffffa007dd31>] ? ext4_file_write_iter+0x121/0x3a0
[ext4]
[91919.498318] [<ffffffff811d321c>] ? __sb_start_write+0xbc/0xd0
[91919.498320] [<ffffffff811d0abf>] ? __vfs_write+0xcf/0xf0
[91919.498322] [<ffffffff811d2123>] ? vfs_write+0xc3/0x1a0
[91919.498323] [<ffffffff811d22e9>] ? SyS_write+0x59/0xc0
[91919.498325] [<ffffffff8159f5d7>] ? entry_SYSCALL_64_fastpath+0x12/0x6f
[91919.498327] rcu_sched kthread starved for 178097 jiffies! g4061489
c4061488 f0x0 s3 ->state=0x1
[91919.750048]
[91919.751744] 15-...: (4 ticks this GP) idle=dbf/140000000000001/0
softirq=16345141/16345141 fqs=2
[91919.761832] (t=178100 jiffies g=4061489 c=4061488 q=0)
[91919.767820] Task dump for CPU 15:
[91919.767821] condor_starter R running task 0 506564 502795
0x00080088 15 FAIR 0 0-23 10504 2591
[91919.767830] 0000000000000010 ffff88207fc63b58 ffffffff8108aec5
ffffffff8108acf8
[91919.776246] 0000000000000001 0000000033322d30 ffffffff817e3471
00ff882000000000
[91919.784645] 0000000000000000 0000000000000000 000000000000000f
ffffffff81a4bb80
[91919.793038] Call Trace:
[91919.795830] <IRQ> [<ffffffff8108aec5>] sched_show_task+0x235/0x450
[91919.803003] [<ffffffff8108acf8>] ? sched_show_task+0x68/0x450
[91919.809569] [<ffffffff810b73af>] ? vprintk_default+0x1f/0x30
[91919.816036] [<ffffffff81154de9>] ? printk+0x46/0x48
[91919.821625] [<ffffffff8108b11f>] dump_cpu_task+0x3f/0x50
[91919.827700] [<ffffffff810c3594>] rcu_dump_cpu_stacks+0x84/0xc0
[91919.834360] [<ffffffff810c3ae9>] check_cpu_stall+0x1d9/0x290
[91919.840823] [<ffffffff810ad66c>] ? __lock_acquire+0x48c/0x17e0
[91919.847504] [<ffffffff810c56f2>] rcu_check_callbacks+0x102/0x3a0
[91919.854371] [<ffffffff810da629>] ? tick_sched_timer+0x29/0x80
[91919.860936] [<ffffffff8105f0a8>] ? __raise_softirq_irqoff+0x28/0x80
[91919.868082] [<ffffffff810a91fd>] ? trace_hardirqs_off+0xd/0x10
[91919.879805] [<ffffffff810c9fa9>] update_process_times+0x39/0x70
[91919.886561] [<ffffffff810da519>] tick_sched_handle+0x39/0x70
[91919.893039] [<ffffffff810da644>] tick_sched_timer+0x44/0x80
[91919.899435] [<ffffffff810caa56>] __hrtimer_run_queues+0xe6/0x2c0
[91919.906286] [<ffffffff810cad9c>] ? hrtimer_interrupt+0x7c/0x1a0
[91919.913042] [<ffffffff810da600>] ? tick_nohz_handler+0xb0/0xb0
[91919.919701] [<ffffffff810a91fd>] ? trace_hardirqs_off+0xd/0x10
[91919.926365] [<ffffffff810d09a8>] ?
ktime_get_update_offsets_now+0xb8/0x170
[91919.934191] [<ffffffff810cad9c>] ? hrtimer_interrupt+0x7c/0x1a0
[91919.940948] [<ffffffff810cadbd>] hrtimer_interrupt+0x9d/0x1a0
[91919.947518] [<ffffffff81090274>] ?
vtime_common_account_irq_enter+0x34/0x50
[91919.955446] [<ffffffff8103b529>] local_apic_timer_interrupt+0x39/0x60
[91919.962786] [<ffffffff815a1fd5>] smp_apic_timer_interrupt+0x45/0x59
[91919.969930] [<ffffffff815a035e>] apic_timer_interrupt+0x8e/0xa0
[91919.976687] <EOI> [<ffffffff812ceb95>] ?
copy_user_enhanced_fast_string+0x5/0x10
[91919.985237] [<ffffffff812d54ec>] ?
iov_iter_copy_from_user_atomic+0x9c/0x240
[91919.993274] [<ffffffff8115535e>] generic_perform_write+0xfe/0x1d0
[91920.000229] [<ffffffff811ed3f2>] ? generic_update_time+0x82/0xd0
[91920.007083] [<ffffffff81157466>] __generic_file_write_iter+0x196/0x1f0
[91920.014529] [<ffffffffa007dd31>] ext4_file_write_iter+0x121/0x3a0
[ext4]
[91920.022161] [<ffffffff811d321c>] ? __sb_start_write+0xbc/0xd0
[91920.028742] [<ffffffff811d0abf>] __vfs_write+0xcf/0xf0
[91920.034629] [<ffffffff811d2123>] vfs_write+0xc3/0x1a0
[91920.040413] [<ffffffff811d22e9>] SyS_write+0x59/0xc0
[91920.046098] [<ffffffff8159f5d7>] entry_SYSCALL_64_fastpath+0x12/0x6f


We also saw some cases of soft lockups showing no stack trace at all and
RIP pointing to userspace memory:


[92163.555725] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 21s!
[strat_sim:570566]
[92163.564455] Modules linked in: nfsv3 nfs_acl nfs msr autofs4 lockd
grace sunrpc cachefiles fscache binfmt_misc nls_iso8859_1 nls_cp437 vfat
fat vhost_net vhost tun kvm irqbypass input_leds hid_generic iTCO_wdt
iTCO_vendor_support pcspkr sfc mtd i2c_algo_bit sb_edac sg lpc_ich
mfd_core ehci_pci ehci_hcd xhci_pci xhci_hcd i2c_i801 i2c_core ixgbe ptp
pps_core mdio ipmi_devintf ipmi_si ipmi_msghandler tpm_tis tpm
acpi_power_meter hwmon ext4 jbd2 mbcache crc16 raid1 dm_mirror
dm_region_hash dm_log dm_mod
[92163.614893] irq event stamp: 12063822
[92163.619020] hardirqs last enabled at (12063821):
[<ffffffff815a00bf>] retint_user+0x18/0x20
[92163.628536] hardirqs last disabled at (12063822):
[<ffffffff815a0339>] apic_timer_interrupt+0x69/0xa0
[92163.638920] softirqs last enabled at (12063814):
[<ffffffff8105fa08>] __do_softirq+0x1f8/0x390
[92163.648723] softirqs last disabled at (12063801):
[<ffffffff8105fceb>] irq_exit+0xfb/0x110
[92163.658042] CPU: 7 PID: 570566 Comm: strat_sim Tainted: G
I 4.4.1-el6.ia32e.lime.0 #1
[92163.668134] Hardware name: Intel Corporation S2600WTT/S2600WTT, BIOS
SE5C610.86B.01.01.0011.081020151200 08/10/2015
[92163.679871] task: ffff880838a30000 ti: ffff8808c270c000 task.ti:
ffff8808c270c000
[92163.688302] RIP: 0033:[<0000000003cd9e89>] [<0000000003cd9e89>]
0x3cd9e89
[92163.696036] RSP: 002b:00007ffdd497e168 EFLAGS: 00000202
[92163.702009] RAX: 000000002860dad8 RBX: 00000000285fb790 RCX:
00000000285fb730
[92163.710024] RDX: 0000000015546340 RSI: 0000000000000001 RDI:
000000002861afb0
[92163.718040] RBP: 00000000285cd220 R08: 0000000015476050 R09:
000000000000000a
[92163.726055] R10: 0000000000000001 R11: 0000000000000009 R12:
ffffffffffffff80
[92163.734071] R13: ffffffffffffff80 R14: 000000000000007f R15:
000000002861afe8
[92163.742088] FS: 00002b21c95e3de0(0000) GS:ffff88103fce0000(0000)
knlGS:0000000000000000
[92163.751199] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[92163.757661] CR2: 00002b21e3bcbd40 CR3: 00000004b3087000 CR4:
00000000001406e0

[these are traces from different machines, btw, so don't try to read
anything into their timestamps]

At this point we're kinda stumped. Everything seems to point to some
problem at a pretty low level, not related to sunrpc or ext4 or any
other high level subsystem... but we can't quite figure out what it is.

Any insights or potential fixes would be highly appreciated.

Thanks,
-Ion