Re: ip_auto_config() prevents network device to be registered

From: Javier Martinez Canillas
Date: Fri Mar 17 2017 - 13:26:40 EST


Hello,

On 01/31/2017 02:49 PM, Javier Martinez Canillas wrote:
>
> The kernelci folks pointed out that a Samsung Exynos based board was failing
> to boot when trying to mount the rootfs via NFS, due a networking issue [0].
>
> I looked at the issue and it turned out to be a race between ip_auto_config()
> and register_netdev() when using the ip=dhcp param in the kernel command line.
>
> The problem is that ip_auto_config() calls wait_for_devices() [1] and returns
> as soon as it finds a network device registered. Then ic_open_devs() [2] is
> called then to bring the network devs up and wait for their carrier signals.
>
> But ic_open_devs() grabs the rtnl_mutex lock [3] when doing this, which is the
> same lock that register_netdev() [4] grabs before registering a network device.
>
> And so if a network dev is found and wait_for_devices() returns, ic_open_devs()
> will be called and no new network dev could be registered in the meantime.
>
> So since ic_open_devs() waits up to CONF_CARRIER_TIMEOUT (120 secs) with this
> lock held, if the network dev that's supposed to get its IP over DHCP isn't the
> first to be registered, the boot test job may timeout and be considered a fail.
>
> A workaround is to use ip=:::::eth0:dhcp instead ip=dhcp, so wait_for_devices()
> waits for this specific device. Another workaround is to increase the timeout
> for the job to be much bigger than CONF_CARRIER_TIMEOUT so ip_auto_config() can
> retry and the network devices can be registered between tries.
>
> But I wonder if someone can suggest a proper way to fix this. Grabbing a mutex
> that prevents network devs to be registered for 120 secs doesn't sound correct.
>
> Thanks a lot for your help and please let me know if I misunderstood something.
>
> [0]: https://storage.kernelci.org/mainline/v4.9/arm-exynos_defconfig/lab-collabora/boot-exynos5422-odroidxu3_rootfs:nfs.html
> [1]: http://lxr.free-electrons.com/source/net/ipv4/ipconfig.c#L1368
> [2]: http://lxr.free-electrons.com/source/net/ipv4/ipconfig.c#L202
> [3]: http://lxr.free-electrons.com/source/net/core/rtnetlink.c#L68
> [4]: http://lxr.free-electrons.com/source/net/core/dev.c#L7326
>
>

Any comments on this?

We are still seeing this problem with today's -next (20170310):

https://storage.kernelci.org/next/next-20170310/arm-exynos_defconfig/lab-collabora/boot-exynos5422-odroidxu3.html

Best regards,
--
Javier Martinez Canillas
Open Source Group
Samsung Research America