RE: [RFC net-next 0/8] Introducing subdev bus and devlink extension
From: Parav Pandit
Date: Thu Mar 07 2019 - 15:27:41 EST
> -----Original Message-----
> From: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> Sent: Thursday, March 7, 2019 1:04 PM
> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jakub Kicinski
> <jakub.kicinski@xxxxxxxxxxxxx>
> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxxxx>; Alex
> Williamson <alex.williamson@xxxxxxxxxx>
> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>
> CC += Alex
>
> On 3/6/2019 11:12 AM, Parav Pandit wrote:
> > Hi Kirti,
> >
> >> -----Original Message-----
> >> From: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> >> Sent: Tuesday, March 5, 2019 9:51 PM
> >> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jakub Kicinski
> >> <jakub.kicinski@xxxxxxxxxxxxx>
> >> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
> >> kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
> davem@xxxxxxxxxxxxx;
> >> gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxxxx>
> >> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
> >> extension
> >>
> >>
> >>
> >> On 3/6/2019 6:14 AM, Parav Pandit wrote:
> >>> Hi Greg, Kirti,
> >>>
> >>>> -----Original Message-----
> >>>> From: Parav Pandit
> >>>> Sent: Tuesday, March 5, 2019 5:45 PM
> >>>> To: Parav Pandit <parav@xxxxxxxxxxxx>; Kirti Wankhede
> >>>> <kwankhede@xxxxxxxxxx>; Jakub Kicinski
> >> <jakub.kicinski@xxxxxxxxxxxxx>
> >>>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> >>>> linux- kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
> >> davem@xxxxxxxxxxxxx;
> >>>> gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxxxx>
> >>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>> extension
> >>>>
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel-
> >>>>> owner@xxxxxxxxxxxxxxx> On Behalf Of Parav Pandit
> >>>>> Sent: Tuesday, March 5, 2019 5:17 PM
> >>>>> To: Kirti Wankhede <kwankhede@xxxxxxxxxx>; Jakub Kicinski
> >>>>> <jakub.kicinski@xxxxxxxxxxxxx>
> >>>>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> >>>>> linux- kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
> >>>>> davem@xxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko
> >>>>> <jiri@xxxxxxxxxxxx>
> >>>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
> >>>>> extension
> >>>>>
> >>>>> Hi Kirti,
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Kirti Wankhede <kwankhede@xxxxxxxxxx>
> >>>>>> Sent: Tuesday, March 5, 2019 4:40 PM
> >>>>>> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jakub Kicinski
> >>>>>> <jakub.kicinski@xxxxxxxxxxxxx>
> >>>>>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> >>>>>> linux- kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
> >>>>>> davem@xxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko
> >>>>>> <jiri@xxxxxxxxxxxx>
> >>>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and
> >>>>>> devlink extension
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> I am novice at mdev level too. mdev or vfio mdev.
> >>>>>>> Currently by default we bind to same vendor driver, but when it
> >>>>>>> was
> >>>>>> created as passthrough device, vendor driver won't create
> >>>>>> netdevice or rdma device for it.
> >>>>>>> And vfio/mdev or whatever mature available driver would bind at
> >>>>>>> that
> >>>>>> point.
> >>>>>>>
> >>>>>>
> >>>>>> Using mdev framework, if you want to partition a physical device
> >>>>>> into multiple logic devices, you can bind those devices to same
> >>>>>> vendor driver through vfio-mdev, where as if you want to
> >>>>>> passthrough the device bind it to vfio-pci. If I understand
> >>>>>> correctly, that is what you are
> >>>>> looking for.
> >>>>>>
> >>>>>>
> >>>>> We cannot bind a whole PCI device to vfio-pci, reason is, A given
> >>>>> PCI device has existing protocol devices on it such as netdevs and
> >>>>> rdma
> >> dev.
> >>>>> This device is partitioned while those protocol devices exist and
> >>>>> mlx5_core, mlx5_ib drivers are loaded on it.
> >>>>> And we also need to connect these objects rightly to eswitch
> >>>>> exposed by devlink interface (net/core/devlink.c) that supports
> >>>>> eswitch binding, health, registers, parameters, ports support.
> >>>>> It also supports existing PCI VFs.
> >>>>>
> >>>>> I donât think we want to replicate all of this again in mdev subsystem
> [1].
> >>>>>
> >>>>> [1]
> >>>>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
> >>>>>
> >>>>> So devlink interface to migrate users from managing VFs to non_VF
> >>>>> sub device is natural progression.
> >>>>>
> >>>>> However, in future, I believe we would be creating mediated
> >>>>> devices on user request, to use mdev modules and map them to VM.
> >>>>>
> >>>>> Also 'mdev_bus' is created as a class and not as a bus. This
> >>>>> limits to not use devlink interface whose handle is bus+device name.
> >>>>>
> >>>>> So one option is to change mdev from class to bus.
> >>>>> devlink will create mdevs on the bus, mdev driver can probe these
> >>>>> devices on host system by default.
> >>>>> And if told to do passthrough, a different driver exposes them to VM.
> >>>>> How feasible is this?
> >>>>>
> >>>> Wait, I do see a mdev bus and mdevs are created on this bus using
> >>>> mdev_device_create().
> >>>> So how about we create mdevs on this bus using devlink, instead of
> sysfs?
> >>>> And driver side on host gets the mdev_register_driver()->probe()?
> >>>>
> >>>
> >>> Thinking more and reviewing more mdev code, I believe mdev fits this
> >>> need a lot better than new subdev bus, mfd, platform device, or
> >>> devlink
> >> subport.
> >>> For coming future, to map this sub device (mdev) to VM will also be
> >>> easier
> >> by using mdev bus.
> >>>
> >>
> >> Thanks for taking close look at mdev code.
> >>
> >> Assigning mdev to VM support is already in place, QEMU and libvirt
> >> have support to assign mdev device to VM.
> >>
> >>> I also believe we can use the sysfs interface for mdev life cycle.
> >>> Here when mdev are created it will register as devlink instance and
> >>> will be able to query/config parameters before driver probe the device.
> >>> (instead of having life cycle via devlink)
> >>>
> >>> Few enhancements would be needed for mdev side.
> >>> 1. making iommu optional.
> >>
> >> Currently mdev devices are not IOMMU aware, vendor driver is
> >> responsible for programming IOMMU for mdev device, if required.
> >> IOMMU aware mdev device patch set is almost reviewed and ready to get
> >> pulled. This is optional, vendor driver have to decide whether mdev
> >> device should be associated with its parents IOMMU or not. I'm
> >> testing it and I think Alex is on vacation and this will get pulled
> >> when Alex will be back from vacation.
> >> https://lwn.net/Articles/779650/
> >>
> >>> 2. configuring mdev device parameters during creation time
> >>>
> >>
> >> Mdev framework provides a way to define multiple types for creation
> >> through sysfs. You can define multiple types rather than having
> >> creation time parameter and on creation accordingly update
> 'available_instances'.
> >> Mdev also provides a way to provide vendor-specific-attributes for
> >> parent physical device as well as for created mdev device. You can
> >> add sysfs interface to get input parameters for a mdev device which
> >> can be used by vendor driver when open() on that mdev device is called.
> >>
> >> Thanks,
> >> Kirti
> >
> > Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
> > Will wait for a day to receive more comments/views from Greg and others.
> >
> > As I explained in this cover-letter and discussion, First use case is
> > to create and use mdevs in the host (and not in VM).
> > Later on, I am sure once we have mdevs available, VM users will likely use
> it.
> >
> > So, mlx5_core driver will have two components as starting point.
> >
> > 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> > This is mdev device life cycle driver which will do, mdev_register_device()
> and implements mlx5_mdev_ops.
> >
> Ok. I would suggest not use mdev.c file name, may be add device name,
> something like mlx_mdev.c or vfio_mlx.c
>
mlx5/core is coding convention is not following to prefix mlx to its 40+ files.
it uses actual subsystem or functionality name, such as,
sriov.c
eswitch.c
fw.c
en_tc.c (en for Ethernet)
lag.c
so,
mdev.c aligns to rest of the 40+ files.
> > 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> > This is mdev device driver which does mdev_register_driver() and
> > probe() creates netdev by heavily reusing existing code of the PF device.
> > These drivers will not be placed under drivers/vfio/mdev, because this is
> not a vfio driver.
> > This is fine, right?
> >
>
> I'm not too familiar with netdev, but can you create netdev on open() call on
> mlx mdev device? Then you don't have to write mdev device driver.
>
Who invokes open() and release()?
I believe it is the qemu would do open(), release, read/write/mmap?
Assuming that is the case,
I think its incorrect to create netdev in open.
Because when we want to map the mdev to VM using above mdev calls, we actually wont be creating netdev in host.
Instead, some queues etc will be setup as part of these calls.
By default this created mdev is bound to vfio_mdev.
And once we unbind the device from this driver, we need to bind to mlx5 driver so that driver can create the netdev etc.
Or did I get open() and friends call wrong?
>
> > Given that this is net driver, we will be submitting patches, through
> > netdev mailing list through Dave Miller's net-next tree.
> > And CC kvm@xxxxxxxxxxxxxxx, you and others as usual.
> > Are you ok, merging code this way as mdev device creator and mdev driver.
> > Yes?
> >
>
> Keep Alex and me in loop.
Sure. Thanks.