hung task in 4.14 (syzbot bug from 2018 April 17)

From: Sebastian Kuzminsky
Date: Wed Oct 03 2018 - 19:29:38 EST


I think i've run into the bug described here:

https://lkml.org/lkml/2018/4/18/188

I've got a binder-free system that reports a hung task with this backtrace:

[76800.726654] INFO: task systemd:1 blocked for more than 60 seconds.
[76800.726657] Tainted: G OE 4.14.67-solidfire1 #1
[76800.726657] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[76800.726659] systemd D 0 1 0 0x00000000
[76800.726662] Call Trace:
[76800.726673] ? __schedule+0x27f/0x870
[76800.726676] schedule+0x28/0x80
[76800.726679] schedule_timeout+0x1e7/0x340
[76800.726685] ? check_preempt_wakeup+0x102/0x230
[76800.726687] ? wait_for_completion+0xb0/0x120
[76800.726689] wait_for_completion+0xb0/0x120
[76800.726693] ? wake_up_q+0x70/0x70
[76800.726698] flush_work+0x10d/0x1c0
[76800.726700] ? worker_detach_from_pool+0xa0/0xa0
[76800.726706] fsnotify_destroy_group+0x34/0xa0
[76800.726708] ? SyS_epoll_ctl+0x1d4/0xe50
[76800.726710] inotify_release+0x1a/0x50
[76800.726714] __fput+0xd8/0x220
[76800.726717] task_work_run+0x8a/0xb0
[76800.726721] exit_to_usermode_loop+0xb9/0xc0
[76800.726723] do_syscall_64+0x10b/0x120
[76800.726727] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[76800.726730] RIP: 0033:0x7fb6957ff900
[76800.726731] RSP: 002b:00007ffc685fdd60 EFLAGS: 00000293 ORIG_RAX:
0000000000000003
[76800.726733] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 00007fb6957ff900
[76800.726735] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000012
[76800.726736] RBP: 00007fb697167088 R08: 000055ae6c9224c0 R09: 000055ae6ace92ad
[76800.726737] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[76800.726738] R13: 0000000000000000 R14: 0000000000079de4 R15: 0000000000000000
[76800.727130] INFO: task kworker/u113:1:29214 blocked for more than 60 seconds.
[76800.727132] Tainted: G OE 4.14.67-solidfire1 #1
[76800.727132] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[76800.727134] kworker/u113:1 D 0 29214 2 0x80000000
[76800.727139] Workqueue: events_unbound fsnotify_mark_destroy_workfn
[76800.727141] Call Trace:
[76800.727144] ? __schedule+0x27f/0x870
[76800.727146] schedule+0x28/0x80
[76800.727148] schedule_timeout+0x1e7/0x340
[76800.727151] ? __switch_to_asm+0x40/0x70
[76800.727153] ? update_curr+0xe1/0x1a0
[76800.727156] ? wait_for_completion+0xb0/0x120
[76800.727157] wait_for_completion+0xb0/0x120
[76800.727160] ? wake_up_q+0x70/0x70
[76800.727164] __synchronize_srcu.part.13+0x76/0x90
[76800.727167] ? trace_raw_output_rcu_utilization+0x40/0x40
[76800.727169] ? try_to_wake_up+0x44/0x460
[76800.727172] ? fsnotify_mark_destroy_workfn+0x67/0xb0
[76800.727174] fsnotify_mark_destroy_workfn+0x67/0xb0
[76800.727177] process_one_work+0x1da/0x3d0
[76800.727179] worker_thread+0x21f/0x3f0
[76800.727181] ? process_one_work+0x3d0/0x3d0
[76800.727184] kthread+0x119/0x130
[76800.727186] ? kthread_create_on_node+0x40/0x40
[76800.727188] ret_from_fork+0x35/0x40

The kernel is a stock 4.14.67, plus some minor local patches mostly
related to fibre channel, which i believe is not implicated here.

I have a crash dump of this failure, the reaper_work struct has these contents:

crash> print reaper_work
$2 = {
work = {
data = {
counter = -108686497013755
},
entry = {
next = 0xffff9d2671a395f0,
prev = 0xffffb2624006fdf8
},
func = 0xffffffffb5249df0
},
timer = {
entry = {
next = 0xdead000000000200,
pprev = 0x0
},
expires = 4302608557,
function = 0xffffffffb50778c0,
data = 18446744072468618080,
flags = 195035137
},
wq = 0xffff9d2680411400,
cpu = 128
}

I'd appreciate help or pointers on how to debug and fix this problem.


--
Sebastian Kuzminsky