Re: net: hang in unregister_netdevice: waiting for lo to become free

From: Dmitry Vyukov
Date: Thu Apr 12 2018 - 08:15:58 EST


On Wed, Feb 21, 2018 at 3:53 PM, Tommi Rantala
<tommi.t.rantala@xxxxxxxxx> wrote:
> On 20.02.2018 18:26, Neil Horman wrote:
>>
>> On Tue, Feb 20, 2018 at 09:14:41AM +0100, Dmitry Vyukov wrote:
>>>
>>> On Tue, Feb 20, 2018 at 8:56 AM, Tommi Rantala
>>> <tommi.t.rantala@xxxxxxxxx> wrote:
>>>>
>>>> On 19.02.2018 20:59, Dmitry Vyukov wrote:
>>>>>
>>>>> Is this meant to be fixed already? I am still seeing this on the
>>>>> latest upstream tree.
>>>>>
>>>>
>>>> These two commits are in v4.16-rc1:
>>>>
>>>> commit 4a31a6b19f9ddf498c81f5c9b089742b7472a6f8
>>>> Author: Tommi Rantala <tommi.t.rantala@xxxxxxxxx>
>>>> Date: Mon Feb 5 21:48:14 2018 +0200
>>>>
>>>> sctp: fix dst refcnt leak in sctp_v4_get_dst
>>>> ...
>>>> Fixes: 410f03831 ("sctp: add routing output fallback")
>>>> Fixes: 0ca50d12f ("sctp: fix src address selection if using
>>>> secondary
>>>> addresses")
>>>>
>>>>
>>>> commit 957d761cf91cdbb175ad7d8f5472336a4d54dbf2
>>>> Author: Alexey Kodanev <alexey.kodanev@xxxxxxxxxx>
>>>> Date: Mon Feb 5 15:10:35 2018 +0300
>>>>
>>>> sctp: fix dst refcnt leak in sctp_v6_get_dst()
>>>> ...
>>>> Fixes: dbc2b5e9a09e ("sctp: fix src address selection if using
>>>> secondary
>>>> addresses for ipv6")
>>>>
>>>>
>>>> I guess we missed something if it's still reproducible.
>>>>
>>>> I can check it later this week, unless someone else beat me to it.
>>>
>>>
>>> Hi Tommi,
>>>
>>> Hmmm, I can't claim that it's exactly the same bug. Perhaps it's
>>> another one then. But I am still seeing these:
>>>
>>> [ 58.799130] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [ 60.847138] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [ 62.895093] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>> [ 64.943103] unregister_netdevice: waiting for lo to become free.
>>> Usage count = 4
>>>
>>> on upstream tree pulled ~12 hours ago.
>>>
>> Can you write a systemtap script to probe dev_hold, and dev_put, printing
>> out a
>> backtrace if the device name matches "lo". That should tell us
>> definitively if
>> the problem is in the same location or not
>
>
> Hi Dmitry, I tested with the reproducer and the kernel .config file that you
> sent in the first email in this thread:
>
> With 4.16-rc2 unable to reproduce.
>
> With 4.15-rc9 bug reproducible, and I get "unregister_netdevice: waiting for
> lo to become free. Usage count = 3"
>
> With 4.15-rc9 and Alexey's "sctp: fix dst refcnt leak in sctp_v6_get_dst()"
> cherry-picked on top, unable to reproduce.
>
>
> Is syzkaller doing something else now to trigger the bug...?
> Can you still trigger the bug with the same reproducer?

Hi Neil, Tommi,

Reviving this old thread about "unregister_netdevice: waiting for lo
to become free. Usage count = 3" hangs.
I still did not have time to deep dive into what happens there (too
many bugs coming from syzbot). But this still actively happens and I
suspect accounts to a significant portion of various hang reports,
which are quite unpleasant.

One idea that could make it all simpler:

Is this wait loop in netdev_wait_allrefs() supposed to wait for any
prolonged periods of time under any non-buggy conditions? E.g. more
than 1-2 minutes?
If it only supposed to wait briefly for things that already supposed
to be shutting down, and we add a WARNING there after some timeout,
then syzbot will report all info how/when it happens, hopefully
extracting reproducers, and all the nice things.
But this WARNING should not have any false positives under any
realistic conditions (e.g. waiting for arrival of remote packets with
large timeouts).

Looking at some task hung reports, it seems that this code holds some
mutexes, takes workqueue thread and prevents any progress with
destruction of other devices (and net namespace creation/destruction),
so I guess it should not wait for any indefinite periods of time?