Re: possible deadlock in flush_work (2)
From: Dmitry Vyukov
Date: Mon Nov 06 2017 - 05:34:57 EST
On Sun, Nov 5, 2017 at 5:00 PM, Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote:
>
>
> On Sun, 2017-11-05 at 11:53 +0300, Dmitry Vyukov wrote:
>> On Sun, Nov 5, 2017 at 11:41 AM, syzbot
>> <bot+d8fe95298ef830cd7d05e33eefa4a5a6f6f334d4@xxxxxxxxxxxxxxxxxxxxxxx
>> om>
>> wrote:
>> > Hello,
>> >
>> > syzkaller hit the following crash on
>> > 0f611fb6dcc0d6d91b4e1fec911321f434a3b858
>> > git://git.cmpxchg.org/linux-mmots.git/master
>> > compiler: gcc (GCC) 7.1.1 20170620
>> > .config is attached
>> > Raw console output is attached.
>> >
>> > xs_tcp_setup_socket: connect returned unhandled error -113
>> > xs_tcp_setup_socket: connect returned unhandled error -113
>> > xs_tcp_setup_socket: connect returned unhandled error -113
>> >
>> > ======================================================
>> > WARNING: possible circular locking dependency detected
>> > 4.14.0-rc5-mm1+ #20 Not tainted
>> > ------------------------------------------------------
>> > kworker/0:3/3400 is trying to acquire lock:
>> > ("xprtiod"){+.+.}, at: [<ffffffff8146adda>] start_flush_work
>> > kernel/workqueue.c:2850 [inline]
>> > ("xprtiod"){+.+.}, at: [<ffffffff8146adda>] flush_work+0x55a/0x8a0
>> > kernel/workqueue.c:2882
>> >
>> > but task is already holding lock:
>> > ((&task->u.tk_work)){+.+.}, at: [<ffffffff81471eb2>]
>> > process_one_work+0xb32/0x1bc0 kernel/workqueue.c:2087
>> >
>> > which lock already depends on the new lock.
>> >
>> >
>> > the existing dependency chain (in reverse order) is:
>> >
>> > -> #1 ((&task->u.tk_work)){+.+.}:
>> > process_one_work+0xba2/0x1bc0 kernel/workqueue.c:2088
>> > worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>> > kthread+0x38b/0x470 kernel/kthread.c:242
>> > ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
>> >
>> > -> #0 ("xprtiod"){+.+.}:
>> > lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3991
>> > start_flush_work kernel/workqueue.c:2851 [inline]
>> > flush_work+0x57f/0x8a0 kernel/workqueue.c:2882
>> > __cancel_work_timer+0x30a/0x7e0 kernel/workqueue.c:2954
>> > cancel_work_sync+0x17/0x20 kernel/workqueue.c:2990
>> > xprt_destroy+0xa1/0x130 net/sunrpc/xprt.c:1467
>> > xprt_destroy_kref net/sunrpc/xprt.c:1477 [inline]
>> > kref_put include/linux/kref.h:70 [inline]
>> > xprt_put+0x38/0x40 net/sunrpc/xprt.c:1501
>> > rpc_task_release_client+0x299/0x430 net/sunrpc/clnt.c:986
>> > rpc_release_resources_task+0x7f/0xa0 net/sunrpc/sched.c:1020
>> > rpc_release_task net/sunrpc/sched.c:1059 [inline]
>> > __rpc_execute+0x4d9/0xe70 net/sunrpc/sched.c:824
>> > rpc_async_schedule+0x16/0x20 net/sunrpc/sched.c:848
>> > process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2112
>> > worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>> > kthread+0x38b/0x470 kernel/kthread.c:242
>> > ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
>> >
>> > other info that might help us debug this:
>> >
>> > Possible unsafe locking scenario:
>> >
>> > CPU0 CPU1
>> > ---- ----
>> > lock((&task->u.tk_work));
>> > lock("xprtiod");
>> > lock((&task->u.tk_work));
>> > lock("xprtiod");
>> >
>> > *** DEADLOCK ***
>> >
>> > 2 locks held by kworker/0:3/3400:
>> > #0: ("rpciod"){+.+.}, at: [<ffffffff81471e5f>] __write_once_size
>> > include/linux/compiler.h:305 [inline]
>> > #0: ("rpciod"){+.+.}, at: [<ffffffff81471e5f>] atomic64_set
>> > arch/x86/include/asm/atomic64_64.h:33 [inline]
>> > #0: ("rpciod"){+.+.}, at: [<ffffffff81471e5f>] atomic_long_set
>> > include/asm-generic/atomic-long.h:56 [inline]
>> > #0: ("rpciod"){+.+.}, at: [<ffffffff81471e5f>] set_work_data
>> > kernel/workqueue.c:618 [inline]
>> > #0: ("rpciod"){+.+.}, at: [<ffffffff81471e5f>]
>> > set_work_pool_and_clear_pending kernel/workqueue.c:645 [inline]
>> > #0: ("rpciod"){+.+.}, at: [<ffffffff81471e5f>]
>> > process_one_work+0xadf/0x1bc0 kernel/workqueue.c:2083
>> > #1: ((&task->u.tk_work)){+.+.}, at: [<ffffffff81471eb2>]
>> > process_one_work+0xb32/0x1bc0 kernel/workqueue.c:2087
>> >
>> > stack backtrace:
>> > CPU: 0 PID: 3400 Comm: kworker/0:3 Not tainted 4.14.0-rc5-mm1+ #20
>> > Hardware name: Google Google Compute Engine/Google Compute Engine,
>> > BIOS
>> > Google 01/01/2011
>> > Workqueue: rpciod rpc_async_schedule
>> > Call Trace:
>> > __dump_stack lib/dump_stack.c:16 [inline]
>> > dump_stack+0x194/0x257 lib/dump_stack.c:52
>> > print_circular_bug.isra.41+0x342/0x36a
>> > kernel/locking/lockdep.c:1258
>> > check_prev_add kernel/locking/lockdep.c:1901 [inline]
>> > check_prevs_add kernel/locking/lockdep.c:2018 [inline]
>> > validate_chain kernel/locking/lockdep.c:2460 [inline]
>> > __lock_acquire+0x2f55/0x3d50 kernel/locking/lockdep.c:3487
>> > lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3991
>> > start_flush_work kernel/workqueue.c:2851 [inline]
>> > flush_work+0x57f/0x8a0 kernel/workqueue.c:2882
>> > __cancel_work_timer+0x30a/0x7e0 kernel/workqueue.c:2954
>> > cancel_work_sync+0x17/0x20 kernel/workqueue.c:2990
>> > xprt_destroy+0xa1/0x130 net/sunrpc/xprt.c:1467
>> > xprt_destroy_kref net/sunrpc/xprt.c:1477 [inline]
>> > kref_put include/linux/kref.h:70 [inline]
>> > xprt_put+0x38/0x40 net/sunrpc/xprt.c:1501
>> > rpc_task_release_client+0x299/0x430 net/sunrpc/clnt.c:986
>> > rpc_release_resources_task+0x7f/0xa0 net/sunrpc/sched.c:1020
>> > rpc_release_task net/sunrpc/sched.c:1059 [inline]
>> > __rpc_execute+0x4d9/0xe70 net/sunrpc/sched.c:824
>> > rpc_async_schedule+0x16/0x20 net/sunrpc/sched.c:848
>> > process_one_work+0xbf0/0x1bc0 kernel/workqueue.c:2112
>> > worker_thread+0x223/0x1990 kernel/workqueue.c:2246
>> > kthread+0x38b/0x470 kernel/kthread.c:242
>> > ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
>>
>>
>> +sunrpc maintainers
>
> A fix for this has already been merged. Please retest with an up to
> date kernel.
Hi,
What's the fix? Please specify it in the following form:
> syzbot will keep track of this bug report.
> Once a fix for this bug is committed, please reply to this email with:
> #syz fix: exact-commit-title
> Note: all commands must start from beginning of the line.
The bot tests HEAD of multiple branches and it needs to know what
fixes what, because this still happens on other branches. Once the bot
knows the fix, it will track when the fix reaches all tested branches.
Thanks