Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to subdev devices

From: Greg KH
Date: Tue Mar 05 2019 - 02:13:37 EST


On Fri, Mar 01, 2019 at 05:21:13PM +0000, Parav Pandit wrote:
>
>
> > -----Original Message-----
> > From: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> > Sent: Friday, March 1, 2019 1:22 AM
> > To: Parav Pandit <parav@xxxxxxxxxxxx>
> > Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > michal.lkml@xxxxxxxxxxx; davem@xxxxxxxxxxxxx; Jiri Pirko
> > <jiri@xxxxxxxxxxxx>
> > Subject: Re: [RFC net-next 8/8] net/mlx5: Add subdev driver to bind to
> > subdev devices
> >
> > On Thu, Feb 28, 2019 at 11:37:52PM -0600, Parav Pandit wrote:
> > > Add a subdev driver to probe the subdev devices and create fake
> > > netdevice for it.
> >
> > So I'm guessing here is the "meat" of the whole goal here?
> >
> > You just want multiple netdevices per PCI device? Why can't you do that
> > today in your PCI driver?
> >
> Yes, but it just not multiple netdevices.
> Let me please elaborate in detail.
>
> There is a swichdev mode of a PCI function for netdevices.
> In this mode a given netdev has additional control netdev (called representor netdevice = rep-ndev).
> This rep-ndev is attached to OVS for adding rules, offloads etc using standard tc, netfilter infra.
> Currently this rep-ndev controls switch side of the settings, but not the host side of netdev.
> So there is discussion to create another netdev or devlink port..
>
> Additionally this subdev has optional rdma device too.
>
> And when we are in switchdev mode, this rdma dev has similar rdma rep device for control.
>
> In some cases we actually don't create netdev when it is in InfiniBand mode.
> Here there is PCI device->rdma_device.
>
> In other case, a given sub device for rdma is dual port device, having netdevice for each that can use existing netdev->dev_port.
>
> Creating 4 devices of two different classes using one iproute2/ip or iproute2/rdma command is horrible thing to do.

Why is that?

> In case if this sub device has to be a passthrough device, ip link command will fail badly that day, because we are creating some sub device which is not even a netdevice.

But it is a network device, right?

> So iproute2/devlink which works on bus+device, mainly PCI today, seems right abstraction point to create sub devices.
> This also extends to map ports of the device, health, registers debug, etc rich infrastructure that is already built.
>
> Additionally, we don't want mlx driver and other drivers to go through its child devices (split logic in netdev and rdma) for power management.

And how is power management going to work with your new devices? All
you have here is a tiny shim around a driver bus, I do not see any new
functionality, and as others have said, no way to actually share, or
split up, the PCI resources.

> Kernel core code does that well today, that we like to leverage through subdev bus or mfd pm callbacks.
>
> So it is lot more than just creating netdevices.

But that's all you are showing here :)

> > What problem are you trying to solve that others also are having that
> > requires all of this?
> >
> > Adding a new bus type and subsystem is fine, but usually we want more
> > than just one user of it, as this does not really show how it is exercised very
> > well.
> This subdev and devlink infrastructure solves this problem of creating smaller sub devices out of one PCI device.
> Someone has to start.. :-)

That is what a mfd should allow you to do.

> To my knowledge, currently Netronome, Broadcom and Mellanox are actively using this devlink and switchdev infra today.

Where are they "using it"? This patchset does not show that.

> > Ideally 3 users would be there as that is when it proves itself that it is
> > flexible enough.
> >
>
> We were looking at drivers/visorbus if we can repurpose it, but GUID device naming scheme is just not user friendly.

You can always change the naming scheme if needed. But why isn't a GUID
ok? It's very easy to reserve properly, and you do not need a central
naming "authority".

> > Would just using the mfd subsystem work better for you? That provides
> > core support for "multi-function" drivers/devices already. What is missing
> > from that subsystem that does not work for you here?
> >
> We were not aware of mfd until now. I looked at very high level now. It's a wrapper to platform devices and seems widely use.
> Before subdev proposal, Jason suggested an alternative is to create platform devices and driver attach to it.
>
> When I read kernel documentation [1], it says "platform devices typically appear as autonomous entities"
> Here instead of autonomy, it is in user's control.
> Platform devices probably don't disappear a lot in live system as opposed to subdevices which are created and removed dynamically a lot often.
>
> Not sure if platform device is abuse for this purpose or not.

No, do not abuse a platform device. You should be able to just use a
normal PCI device for this just fine, and if not, we should be able to
make the needed changes to mfd for that.

thanks,

greg k-h