RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension

From: Parav Pandit
Date: Sun Mar 03 2019 - 23:41:16 EST

> -----Original Message-----
> From: Jakub Kicinski <jakub.kicinski@xxxxxxxxxxxxx>
> Sent: Friday, March 1, 2019 2:04 PM
> To: Parav Pandit <parav@xxxxxxxxxxxx>; Or Gerlitz <gerlitz.or@xxxxxxxxx>
> Cc: netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> michal.lkml@xxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxxxx>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
> On Thu, 28 Feb 2019 23:37:44 -0600, Parav Pandit wrote:
> > Requirements for above use cases:
> > --------------------------------
> > 1. We need a generic user interface & core APIs to create sub devices
> > from a parent pci device but should be generic enough for other parent
> > devices 2. Interface should be vendor agnostic 3. User should be able
> > to set device params at creation time 4. In future if needed, tool
> > should be able to create passthrough device to map to a virtual
> > machine
> Like a mediated device?

> Devices-Better-Userland-IO.pdf
> Other than pass-through it is entirely unclear to me why you'd need a bus.
> (Or should I say VM pass through or DPDK?) Could you clarify why the need
> for a bus?
A bus follow standard linux kernel device driver model to attach a driver to specific device.
Platform device with my limited understanding looks a hack/abuse of it based on documentation [1], but it can possibly be an alternative to bus if it looks fine to Greg and others.

> My thinking is that we should allow spawning subports in devlink and if user
> specifies "passthrough" the device spawned would be an mdev.
devlink device is much more comprehensive way to create sub-devices than sub-ports for at least below reasons.

1. devlink device already defines device->port relation which enables to create multiport device.
subport breaks that.
2. With bus model, it enables us to load driver of same vendor or generic one such a vfio in future.
3. Devices live on the bus, mapping a subport to 'struct device' is not intuitive.
4. sub-device allows to use existing devlink port, registers, health infrastructure to sub devices, which otherwise need to be duplicated for ports.
5. Even though current devlink devices are networking devices, there is nothing restricts it to be that way.
So subport is a restricted view.
6. devlink device already covers port sub-object, hence creating devlink device is desired.

> > 5. A device can have multiple ports
> What does this mean, in practice? You want to spawn a subdev which can
> access both ports? That'd be for RDMA use cases, more than Ethernet,
> right? (Just clarifying :))
Yep, you got it right. :-)

> > So how is it done?
> > ------------------
> > (a) user in control
> > To address above requirements, a generic tool iproute2/devlink is
> > extended for sub device's life cycle.
> > However a devlink tool and its kernel counter part is not sufficient
> > to create protocol agnostic devices on a existing PCI bus.
> "Protocol agnostic"?... What does that mean?
Devlink works on bus,device model. It doesn't matter what class of device is.
For example, for pci class can be anything. So newly created sub-devices are not limited to netdev/rdma devices.
Its agnostic to protocol.
More importantly, we don't want to create these sub-devices who bus type is 'pci'.
Because as described below, PCI has its addressing scheme and pci bus must not have mix-n match devices.

So probably better wording should be,
'a devlink tool and its kernel counterpart is not sufficient to create sub-devices of same class as that of PCI device.

> > (b) subdev bus
> > A given bus defines well defined addressing scheme. Creating sub
> > devices on existing PCI bus with a different naming scheme is just weird.
> > So, creating well named devices on appropriate bus is desired.
> What's that address scheme you're referring to, you seem to assign IDs in
> sequence?
Yes. a device on subdev bus follows standard linux driver model based id assignment scheme = u32.
And devices are well named as 'subdev0'. Prefix + id as the default scheme of core driver model.

> >
> > Given that, these are user created devices for a given hardware and in
> > absence of a central entity like PCISIG to assign vendor and device
> > ids, A unique vendor and device id are maintained as enum in
> > include/linux/subdev_ids.h.
> Why do we need IDs? The sysfs hierarchy isn't sufficient?

> Do we need a driver to match on those again? Is it going to be a different driver?
IDs are used to match driver against the created device.
It can be same or different driver.
Even in same driver case, it provides a clear code separation for creating sub-devices and their respective one or more protocol devices (netdev, rep-netdev, rdma ..)

> > subdev bus device names follow default device naming scheme of Linux
> > kernel. It is done as 'subdev<instance_id>' such as, subdev0, subdev3.
> >
> > System example view:
> > --------------------
> >
> > $ devlink dev show
> > pci/0000:05:00.0
> >
> > $ devlink dev add pci/0000:05:00.0
> That does not look great.
Yes, It must return bus+device attributes in user output too
Code in existing patchset returns it, it is not shown here.
I will fix the cover-letter.

> Also you have to return the id of the spawned device, otherwise this is very
> racy.
Yes, that is correct. It must return an devlink device id = {bus+device} attr.
I will update the example in v2.

> > $ devlink dev show
> > pci/0000:05:00.0
> > subdev/subdev0
> Please don't spawn devlink instances. Devlink instance is supposed to
> represent an ASIC. If we start spawning them willy nilly for whatever
> software construct we want to model the clarity of the ontology will suffer a
> lot.
Devlink devices not restricted to ASIC even though today it is representing ASIC for one vendor.
Today for one ASIC, it already presents multiple devlink devices (128 or more) for PF and VFs, two PFs on same ASIC etc.
VF is just a sub-device which is well defined by PCISIG, whereas sub-device is not.
Sub-device do consume actual ASIC resources (just like PFs and VFs),
Hence point-(6) of cover-letter indicate that the devlink capability to tell how many such sub-devices can be created.

In above example, they are created for a given bus-device following existing devlink construct.

> Please see the discussion on my recent patchset. I think Jiri CCed you.
I will review the discussion in short while after this reply, and provide comments.

> > Alternatives considered:
> > ------------------------
> > Will discuss separately if needed to keep this RFC short.
> Please do discuss.
(a) subports instead of subdevices.
We dropped this option because its two restrictive; I explained above the benefits of devlink device.

(b) extending iproute2/ip link and iproute2/rdma tools to creating sub-devices.
But that is too limiting which doesn't provide all the features we get using devlink.
It also doesn't address the passthrough needs and its just ugly to create and manage PCI level devices using high level tools like 'ip' and 'rdma'.

(c) creating platform device and platform driver instead of subdev bus
Our understanding is that - platform device for this purpose would be an abuse/misuse, but our view is limited based on kernel documentation in [2].
[1] says "platform devices typically appear as autonomous entities"
Sub-devices are well managed, created, configurable by user.
Most things of [1] -> "Platform devices" section do not match with subdev.

Greg suggested to use mfd framework (wrapper to platform), which also needs extension.
mfd_remove_devices() removes all the devices, while here based on user request, we want to add/remove individual device.
Will wait if he is ok with subdev bus or he prefers to extend the platform documentation and mfd for removing individual devices.

(d) drivers/visorbus
This bus is limited to UUID/GUID based naming scheme and very specific to s-Par standard and vendor.
Additionally its guest drivers are living in staging for more than year.
So it doesn't appear the right direction.

(e) creating subdev as child objects of devlink device (such as port, registers, health, etc).
In this mode, a given devlink device has multiport child device which is anchored using 'struct device' and life cycled through devlink.
Only difference with current proposal is it doesn't follow standard driver model to bind to other driver.
It also doesn't show in unified way using devlink dev show.

So instead of these alternatives, devlink device that matches PF, VF, sub-device, + subdev bus seems better design.
This follows all standard constructs of 1. Devlink, 2. Linux driver model.
It is not limited to ports and generic enough for networking and not networking devices.

> The things key thing for me on the netdev side is what is the forwarding
> model to this new entity. Is this basically VMDQ?
> Should we just go ahead and mandate "switchdev mode" here?
It will follow the switchdev mode, but it not limited to it.
Switchdev mode is for the eswitch functionality. There isn't a need to combine this.
rdma Infiniband will be able to use this without switchdev mode.

> Thanks for working on a common architecture and suffering through
> people's reviews rather than adding a debugfs interface that does this like a
> different vendor did :)
Oh yes, lets not do debugfs.
Thanks a lot Jakub for the review.
This common architecture should be able to address such common needs.
Please let me know if this needs more refinement, if I missed something.