Re: rcu_sched stall while waiting in csd_lock_wait()

From: Pratyush Anand
Date: Wed Aug 02 2017 - 23:55:51 EST


Hi Will,

On Wednesday 02 August 2017 01:38 PM, Will Deacon wrote:
Hi Pratyush,

On Wed, Aug 02, 2017 at 09:01:19AM +0530, Pratyush Anand wrote:
I am observing following rcu_sched stall while executing `perf record -a --
sleep 1` with one of the arm64 platform. It looks like that stalled cpu was
waiting in csd_lock_wait() from where it never came out,and so the stall.
Any help/pointer for further debugging would be very helpful. Problem also
reproduced with 4.13.0-rc3.
When you say "also", which other kernel(s) show the problem? Is this a
recent regression? Which platform are you running on?

Other than 4.13.0-rc3 I had tested with 4.11 based kernel which has 4.11 vanila+some 4.12 patches back ported, and log which I had attached was from same kernel, thats why I mentioned that it reproduced with vanilla upstream as well.


It would be interesting to know what the other CPUs are doing, in particular
the target of the cross-call. Either it crashed spectacularly and didn't
unlock the csd lock, or the IPI somehow wasn't delivered.

Do you see any other splats if you enable lock debugging?

It was same.

Following is the log from 4.13.0-rc3 + patch pointed by Peter:

[ 173.649589] perf: interrupt took too long (4952 > 4902), lowering kernel.perf_event_max_sample_rate to 40300
[ 201.340926] INFO: rcu_sched self-detected stall on CPU
[ 201.345115] 9-...: (6499 ticks this GP) idle=e1a/140000000000001/0 softirq=334/334 fqs=3250
[ 201.353617] (t=6500 jiffies g=313 c=312 q=428)
[ 201.358220] Task dump for CPU 9:
[ 201.361431] perf R running task 0 1888 1864 0x00000202
[ 201.368462] Call trace:
[ 201.370897] [<ffff000008089498>] dump_backtrace+0x0/0x28c
[ 201.376276] [<ffff000008089748>] show_stack+0x24/0x2c
[ 201.381312] [<ffff00000810c5d0>] sched_show_task+0x19c/0x26c
[ 201.386952] [<ffff00000810df28>] dump_cpu_task+0x48/0x54
[ 201.392250] [<ffff00000816c564>] rcu_dump_cpu_stacks+0xac/0xf4
[ 201.398063] [<ffff00000816b714>] rcu_check_callbacks+0x908/0xc90
[ 201.404053] [<ffff000008172fe0>] update_process_times+0x34/0x5c
[ 201.409957] [<ffff00000818697c>] tick_sched_handle.isra.16+0x4c/0x70
[ 201.416292] [<ffff0000081869e8>] tick_sched_timer+0x48/0x88
[ 201.421847] [<ffff000008173c60>] __hrtimer_run_queues+0x17c/0x604
[ 201.427924] [<ffff000008174b90>] hrtimer_interrupt+0xa4/0x1e8
[ 201.433656] [<ffff0000086d9d34>] arch_timer_handler_phys+0x3c/0x48
[ 201.439818] [<ffff00000815b718>] handle_percpu_devid_irq+0xdc/0x42c
[ 201.446069] [<ffff00000815569c>] generic_handle_irq+0x34/0x4c
[ 201.451796] [<ffff000008155dec>] __handle_domain_irq+0x6c/0xc4
[ 201.457611] [<ffff0000080816c4>] gic_handle_irq+0xa0/0x1b0
[ 201.463080] Exception stack(0xffff8016df013a40 to 0xffff8016df013b70)
[ 201.469504] 3a40: 0000000000000000 0000000000000003 0000000000000000 ffff8016df013bd0
[ 201.477316] 3a60: ffff8016df013bd0 0000000000080000 ffff8016df013bb8 ffff0000082113c8
[ 201.485129] 3a80: 0000000000000000 0000ffffe507a9f0 00000000ffffffff 0000000000000000
[ 201.492941] 3aa0: 0000000000000005 ffffffffffffffff 002f547d23157399 00003a2a9f82ac9c
[ 201.500754] 3ac0: 0000000000000000 0000000000000000 0000ffffe507a7e0 ffff000008f5b000
[ 201.508566] 3ae0: ffff8016df013c08 ffff000008213fcc 0000000000000013 ffff8017616a7800
[ 201.516379] 3b00: ffff000008f5b000 ffff0000082179d4 0000000000000000 ffff0000088c1000
[ 201.524191] 3b20: ffff8017616a7800 ffff8016df013b70 ffff00000818bf04 ffff8016df013b70
[ 201.532004] 3b40: ffff00000818bf28 0000000020400145 ffff000008213fcc 0000000000000013
[ 201.539816] 3b60: 0001000000000000 ffff8016df013bb8
[ 201.544677] [<ffff000008082eb8>] el1_irq+0xb8/0x140
[ 201.549539] [<ffff00000818bf28>] smp_call_function_single+0x160/0x184
[ 201.555965] [<ffff000008211128>] cpu_function_call+0x40/0x64
[ 201.561605] [<ffff0000082113a0>] event_function_call+0x120/0x128
[ 201.567594] [<ffff0000082113ec>] _perf_event_disable+0x44/0x64
[ 201.573410] [<ffff000008210afc>] perf_event_for_each_child+0x3c/0x84
[ 201.579747] [<ffff00000821bcfc>] perf_ioctl+0x21c/0x9a4
[ 201.584957] [<ffff0000082eec6c>] do_vfs_ioctl+0xcc/0x874
[ 201.590250] [<ffff0000082ef4a4>] sys_ioctl+0x90/0xa4
[ 201.595198] [<ffff0000080834a0>] __sys_trace_return+0x0/0x4
[ 239.003035] INFO: rcu_sched detected expedited stalls on CPUs/tasks: { 9-... } 6592 jiffies s: 1149 root: 0x1/.
[ 239.012199] blocking rcu_node structures: l=1:0-14:0x200/.
[ 239.017695] Task dump for CPU 9:
[ 239.020880] perf R running task 0 1888 1864 0x00000202
[ 239.027929] Call trace:
[ 239.030346] [<ffff000008085c9c>] __switch_to+0x64/0x70
[ 239.035484] [<ffff000008235a00>] free_pcppages_bulk+0x43c/0x640
[ 262.304244] perf: interrupt took too long (6221 > 6190), lowering kernel.perf_event_max_sample_rate to 32100
[ 367.009704] INFO: task kworker/15:2:1187 blocked for more than 120 seconds.
[ 367.015713] Tainted: G W 4.13.0-rc3+ #2
[ 367.021200] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 367.028994] kworker/15:2 D 0 1187 2 0x00000000
[ 367.034481] Workqueue: usb_hub_wq hub_event
[ 367.038629] Call trace:
[ 367.041077] [<ffff000008085c9c>] __switch_to+0x64/0x70
[ 367.046184] [<ffff0000088a3680>] __schedule+0x410/0xcbc
[ 367.051403] [<ffff0000088a3f6c>] schedule+0x40/0xa4
[ 367.056253] [<ffff00000816a578>] _synchronize_rcu_expedited.constprop.64+0x370/0x3e8
[ 367.063990] [<ffff00000816a66c>] synchronize_sched_expedited+0x7c/0xf0
[ 367.070501] [<ffff00000816aa0c>] synchronize_rcu_expedited+0x14/0x1c
[ 367.076825] [<ffff00000825ec24>] bdi_unregister+0x78/0x200
[ 367.082307] [<ffff00000841fbac>] del_gendisk+0x130/0x240
[ 367.087592] [<ffff000003de0688>] sr_remove+0x30/0x90 [sr_mod]
[ 367.093333] [<ffff00000859901c>] device_release_driver_internal+0x15c/0x1dc
[ 367.100275] [<ffff0000085990c4>] device_release_driver+0x28/0x34
[ 367.106251] [<ffff00000859794c>] bus_remove_device+0xe4/0x14c
[ 367.111993] [<ffff000008593890>] device_del+0x1dc/0x308
[ 367.117191] [<ffff0000085e121c>] __scsi_remove_device+0xfc/0x128
[ 367.123192] [<ffff0000085de894>] scsi_forget_host+0x74/0x78
[ 367.128778] [<ffff0000085d0cf8>] scsi_remove_host+0x7c/0x120
[ 367.134396] [<ffff000000ad3720>] usb_stor_disconnect+0x58/0xc8 [usb_storage]
[ 367.141421] [<ffff000008636fd8>] usb_unbind_interface+0x7c/0x268
[ 367.147397] [<ffff00000859901c>] device_release_driver_internal+0x15c/0x1dc
[ 367.154355] [<ffff0000085990c4>] device_release_driver+0x28/0x34
[ 367.160344] [<ffff00000859794c>] bus_remove_device+0xe4/0x14c
[ 367.166060] [<ffff000008593890>] device_del+0x1dc/0x308
[ 367.171282] [<ffff0000086344fc>] usb_disable_device+0xc4/0x2ac
[ 367.177084] [<ffff000008629de4>] usb_disconnect+0x98/0x248
[ 367.182566] [<ffff000008629ff8>] hub_quiesce+0x64/0xac
[ 367.187674] [<ffff00000862bfc0>] hub_event+0x110/0xa8c
[ 367.192810] [<ffff0000080f2a64>] process_one_work+0x228/0x6a8
[ 367.198568] [<ffff0000080f2f44>] worker_thread+0x60/0x3bc
[ 367.203922] [<ffff0000080faa40>] kthread+0x114/0x140
[ 367.208898] [<ffff0000080833d0>] ret_from_fork+0x10/0x40
[ 367.214171] INFO: task kworker/15:3:1890 blocked for more than 120 seconds.
[ 367.221108] Tainted: G W 4.13.0-rc3+ #2
[ 367.226563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 367.234388] kworker/15:3 D 0 1890 2 0x00000200
[ 367.239918] Workqueue: events wait_rcu_exp_gp
[ 367.244184] Call trace:
[ 367.246615] [<ffff000008085c9c>] __switch_to+0x64/0x70
[ 367.251751] [<ffff0000088a3680>] __schedule+0x410/0xcbc
[ 367.256945] [<ffff0000088a3f6c>] schedule+0x40/0xa4
[ 367.261820] [<ffff0000088a936c>] schedule_timeout+0x1e8/0x454
[ 367.267536] [<ffff000008165da0>] rcu_exp_wait_wake+0xb8/0x5b0
[ 367.273278] [<ffff00000816a1f8>] wait_rcu_exp_gp+0x3c/0x4c
[ 367.278776] [<ffff0000080f2a64>] process_one_work+0x228/0x6a8
[ 367.284475] [<ffff0000080f2f44>] worker_thread+0x60/0x3bc
[ 367.289931] [<ffff0000080faa40>] kthread+0x114/0x140
[ 367.294793] [<ffff0000080833d0>] ret_from_fork+0x10/0x40
[ 367.300100] INFO: lockdep is turned off.

--
Regards
Pratyush