Re: [PATCH net-next] netdevsim: Register and unregister devlink traps on probe/remove device

From: Leon Romanovsky
Date: Wed Oct 27 2021 - 02:44:02 EST


On Tue, Oct 26, 2021 at 01:03:41PM -0700, Edwin Peer wrote:
> On Tue, Oct 26, 2021 at 12:22 PM Leon Romanovsky <leon@xxxxxxxxxx> wrote:
>
> > At least in mlx5 case, reload_enable() was before register_netdev().
> > It stayed like this after swapping it with devlink_register().
>
> What am I missing here?
>
> err = mlx5_init_one(dev);
> if (err) {
> mlx5_core_err(dev, "mlx5_init_one failed with error code %d\n", err);
> goto err_init_one;
> }
>
> err = mlx5_crdump_enable(dev);
> if (err)
> dev_err(&pdev->dev, "mlx5_crdump_enable failed with error code
> %d\n", err);
>
> pci_save_state(pdev);
> devlink_register(devlink);
>
> Doesn't mlx5_init_one() ultimately result in the netdev being
> presented to user space, even if it is via aux bus?

The mlx5_init_one() aux devices, and driver is not always loaded
directly in the Linux kernel. The device creation triggers udev event,
which is handled by udev systemd. The systemd reads various modules.* files
that kernel provides and this is how it knows which driver to load.

In our case, the eth driver is part of mlx5_core module, so at the
device creation phase that module is already loaded and driver/core
will try to autoprobe it.

However, the last step is not always performed and controlled by the
userspace. Users can disable driver autoprobe and bind manually. This
is pretty standard practice in the SR-IOV or VFIO modes.

>
> > No, it is not requirement, but my suggestion. You need to be aware that
> > after call to devlink_register(), the device will be fully open for devlink
> > netlink access. So it is strongly advised to put devlink_register to be the
> > last command in PCI initialization sequence.
>
> Right, that's the problem. Once we register the netdev, we're in a
> race with user space, which may expect to be able to call devlink
> before we get to devlink_register().

This is why devlink has monitor mode where you can see devlink device
addition and removal. It is user space job to check that device is
ready.

>
> > You obviously need to fix your code. Upstream version of bnxt driver
> > doesn't have reload_* support, so all this regression blaming it not
> > relevant here.
>
> Right, our timing is unfortunate and that's on us. It's still not
> clear to me how to actually fix the devlink reload code without the
> benefit of something similar to the reload enable API.
>
> > In upstream code, devlink_register() doesn't accept ops like it was
> > before and position of that call does only one thing - opens devlink
> > netlink access. All kernel devlink APIs continue to be accessible even
> > before devlink_register.
>
> This isn't about kernel API. This is precisely about existing user
> space that expects devlink to work immediately after the netdev
> appears.

Can you please share open source project that has such assumption?

>
> > It looks like your failure is in backport code.
>
> Our out-of-tree driver isn't the issue here. I'm talking about the
> proposed upstream code. The issue is what to do in order to get
> something workable upstream for devlink reload. We can't move
> devlink_register() later, that will cause a regression. What do you
> suggest instead?

Fix your test respect devlink notifications and don't ignore them.

Thanks

>
> Regards,
> Edwin Peer