Re: INFO: task hung in ip6gre_exit_batch_net

From: Dmitry Vyukov
Date: Thu Jun 07 2018 - 15:04:01 EST


On Thu, Jun 7, 2018 at 8:54 PM, Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> wrote:
>>>>> Hi, Dmirty!
>>>>>
>>>>> On 04.06.2018 18:22, Dmitry Vyukov wrote:
>>>>>> On Mon, Jun 4, 2018 at 5:03 PM, syzbot
>>>>>> <syzbot+bf78a74f82c1cf19069e@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> syzbot found the following crash on:
>>>>>>>
>>>>>>> HEAD commit: bc2dbc5420e8 Merge branch 'akpm' (patches from Andrew)
>>>>>>> git tree: upstream
>>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=164e42b7800000
>>>>>>> kernel config: https://syzkaller.appspot.com/x/.config?x=982e2df1b9e60b02
>>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=bf78a74f82c1cf19069e
>>>>>>> compiler: gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>>>
>>>>>>> Unfortunately, I don't have any reproducer for this crash yet.
>>>>>>>
>>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>>> Reported-by: syzbot+bf78a74f82c1cf19069e@xxxxxxxxxxxxxxxxxxxxxxxxx
>>>>>>
>>>>>> Another hang on rtnl lock:
>>>>>>
>>>>>> #syz dup: INFO: task hung in netdev_run_todo
>>>>>>
>>>>>> May be related to "unregister_netdevice: waiting for DEV to become free":
>>>>>> https://syzkaller.appspot.com/bug?id=1a97a5bd119fd97995f752819fd87840ab9479a9
>>>>
>>>> netdev_wait_allrefs does not hold rtnl lock during waiting, so it must
>>>> be something different.
>>>>
>>>>
>>>>>> Any other explanations for massive hangs on rtnl lock for minutes?
>>>>>
>>>>> To exclude the situation, when a task exists with rtnl_mutex held:
>>>>>
>>>>> would the pr_warn() from print_held_locks_bug() be included in the console output
>>>>> if they appear?
>>>>
>>>> Yes, everything containing "WARNING:" is detected as bug.
>>>
>>> OK, then dead task not releasing the lock is excluded.
>>>
>>> One more assumption: someone corrupted memory around rtnl_mutex and it looks like locked.
>>> (I track lockdep "(rtnl_mutex){+.+.}" prints in initial message as "nobody owns rtnl_mutex").
>>> There may help a crash dump of the VM.
>>
>> I can't find any legend for these +'s and .'s, but {+.+.} is present
>> in large amounts in just any task hung report for different mutexes,
>> so I would not expect that it means corruption.
>>
>> Are dozens of known corruptions that syzkaller can trigger. But
>> usually they are reliably caught by KASAN. If any of them would lead
>> to silent memory corruption, we would got dozens of assorted crashes
>> throughout the kernel. We've seen that at some points, but not
>> recently. So I would assume that memory is not corrupted in all these
>> cases:
>> https://syzkaller.appspot.com/bug?id=2503c576cabb08d41812e732b390141f01a59545
>
> This BUG clarifies the {+.+.}:
>
> 4 locks held by kworker/0:145/381:
> #0: ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] work_static include/linux/workqueue.h:198 [inline]
> #0: ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_data kernel/workqueue.c:619 [inline]
> #0: ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline]
> #0: ((wq_completion)"hwsim_wq"){+.+.}, at: [<000000003f9487f0>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084
> #1: ((work_completion)(&data->destroy_work)){+.+.}, at: [<00000000bbdd2115>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088
> #2: (rtnl_mutex){+.+.}, at: [<000000009c9d14f8>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74
> #3: (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] exp_funnel_lock kernel/rcu/tree_exp.h:272 [inline]
> #3: (rcu_sched_state.exp_mutex){+.+.}, at: [<000000001ba1a807>] _synchronize_rcu_expedited.constprop.72+0x9fa/0xac0 kernel/rcu/tree_exp.h:596
>
> There we have rtnl_mutex locked and the {..} is like above. It's definitely locked
> since there is one more lock after it.
>
> This BUG happen because of there are many rtnl_mutex waiters while owner
> is synchronizing RCU. Rather clear for me in comparison to the topic's hung.


You mean that it's not hanged, but rather needs more than 2 minutes to
complete, right?


>> I wonder if it can be just that slow, but not actually hanged... net
>> namespace destruction is super slow, so perhaps under heavy load it
>> all stalls for minutes...