Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension

From: Kirti Wankhede
Date: Thu Mar 07 2019 - 14:04:43 EST


CC += Alex

On 3/6/2019 11:12 AM, Parav Pandit wrote:
> Hi Kirti,
>
>> -----Original Message-----
>> From: Kirti Wankhede <kwankhede@xxxxxxxxxx>
>> Sent: Tuesday, March 5, 2019 9:51 PM
>> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jakub Kicinski
>> <jakub.kicinski@xxxxxxxxxxxxx>
>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
>> kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx; davem@xxxxxxxxxxxxx;
>> gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxxxx>
>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink extension
>>
>>
>>
>> On 3/6/2019 6:14 AM, Parav Pandit wrote:
>>> Hi Greg, Kirti,
>>>
>>>> -----Original Message-----
>>>> From: Parav Pandit
>>>> Sent: Tuesday, March 5, 2019 5:45 PM
>>>> To: Parav Pandit <parav@xxxxxxxxxxxx>; Kirti Wankhede
>>>> <kwankhede@xxxxxxxxxx>; Jakub Kicinski
>> <jakub.kicinski@xxxxxxxxxxxxx>
>>>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; linux-
>>>> kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
>> davem@xxxxxxxxxxxxx;
>>>> gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko <jiri@xxxxxxxxxxxx>
>>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>> extension
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: linux-kernel-owner@xxxxxxxxxxxxxxx <linux-kernel-
>>>>> owner@xxxxxxxxxxxxxxx> On Behalf Of Parav Pandit
>>>>> Sent: Tuesday, March 5, 2019 5:17 PM
>>>>> To: Kirti Wankhede <kwankhede@xxxxxxxxxx>; Jakub Kicinski
>>>>> <jakub.kicinski@xxxxxxxxxxxxx>
>>>>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
>>>>> linux- kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
>>>>> davem@xxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko
>>>>> <jiri@xxxxxxxxxxxx>
>>>>> Subject: RE: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>>> extension
>>>>>
>>>>> Hi Kirti,
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Kirti Wankhede <kwankhede@xxxxxxxxxx>
>>>>>> Sent: Tuesday, March 5, 2019 4:40 PM
>>>>>> To: Parav Pandit <parav@xxxxxxxxxxxx>; Jakub Kicinski
>>>>>> <jakub.kicinski@xxxxxxxxxxxxx>
>>>>>> Cc: Or Gerlitz <gerlitz.or@xxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
>>>>>> linux- kernel@xxxxxxxxxxxxxxx; michal.lkml@xxxxxxxxxxx;
>>>>>> davem@xxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; Jiri Pirko
>>>>>> <jiri@xxxxxxxxxxxx>
>>>>>> Subject: Re: [RFC net-next 0/8] Introducing subdev bus and devlink
>>>>>> extension
>>>>>>
>>>>>>
>>>>>>
>>>>>>> I am novice at mdev level too. mdev or vfio mdev.
>>>>>>> Currently by default we bind to same vendor driver, but when it
>>>>>>> was
>>>>>> created as passthrough device, vendor driver won't create netdevice
>>>>>> or rdma device for it.
>>>>>>> And vfio/mdev or whatever mature available driver would bind at
>>>>>>> that
>>>>>> point.
>>>>>>>
>>>>>>
>>>>>> Using mdev framework, if you want to partition a physical device
>>>>>> into multiple logic devices, you can bind those devices to same
>>>>>> vendor driver through vfio-mdev, where as if you want to
>>>>>> passthrough the device bind it to vfio-pci. If I understand
>>>>>> correctly, that is what you are
>>>>> looking for.
>>>>>>
>>>>>>
>>>>> We cannot bind a whole PCI device to vfio-pci, reason is, A given
>>>>> PCI device has existing protocol devices on it such as netdevs and rdma
>> dev.
>>>>> This device is partitioned while those protocol devices exist and
>>>>> mlx5_core, mlx5_ib drivers are loaded on it.
>>>>> And we also need to connect these objects rightly to eswitch exposed
>>>>> by devlink interface (net/core/devlink.c) that supports eswitch
>>>>> binding, health, registers, parameters, ports support.
>>>>> It also supports existing PCI VFs.
>>>>>
>>>>> I donât think we want to replicate all of this again in mdev subsystem [1].
>>>>>
>>>>> [1]
>>>>> https://www.kernel.org/doc/Documentation/vfio-mediated-device.txt
>>>>>
>>>>> So devlink interface to migrate users from managing VFs to non_VF
>>>>> sub device is natural progression.
>>>>>
>>>>> However, in future, I believe we would be creating mediated devices
>>>>> on user request, to use mdev modules and map them to VM.
>>>>>
>>>>> Also 'mdev_bus' is created as a class and not as a bus. This limits
>>>>> to not use devlink interface whose handle is bus+device name.
>>>>>
>>>>> So one option is to change mdev from class to bus.
>>>>> devlink will create mdevs on the bus, mdev driver can probe these
>>>>> devices on host system by default.
>>>>> And if told to do passthrough, a different driver exposes them to VM.
>>>>> How feasible is this?
>>>>>
>>>> Wait, I do see a mdev bus and mdevs are created on this bus using
>>>> mdev_device_create().
>>>> So how about we create mdevs on this bus using devlink, instead of sysfs?
>>>> And driver side on host gets the mdev_register_driver()->probe()?
>>>>
>>>
>>> Thinking more and reviewing more mdev code, I believe mdev fits this
>>> need a lot better than new subdev bus, mfd, platform device, or devlink
>> subport.
>>> For coming future, to map this sub device (mdev) to VM will also be easier
>> by using mdev bus.
>>>
>>
>> Thanks for taking close look at mdev code.
>>
>> Assigning mdev to VM support is already in place, QEMU and libvirt have
>> support to assign mdev device to VM.
>>
>>> I also believe we can use the sysfs interface for mdev life cycle.
>>> Here when mdev are created it will register as devlink instance and
>>> will be able to query/config parameters before driver probe the device.
>>> (instead of having life cycle via devlink)
>>>
>>> Few enhancements would be needed for mdev side.
>>> 1. making iommu optional.
>>
>> Currently mdev devices are not IOMMU aware, vendor driver is responsible
>> for programming IOMMU for mdev device, if required.
>> IOMMU aware mdev device patch set is almost reviewed and ready to get
>> pulled. This is optional, vendor driver have to decide whether mdev device
>> should be associated with its parents IOMMU or not. I'm testing it and I
>> think Alex is on vacation and this will get pulled when Alex will be back from
>> vacation.
>> https://lwn.net/Articles/779650/
>>
>>> 2. configuring mdev device parameters during creation time
>>>
>>
>> Mdev framework provides a way to define multiple types for creation
>> through sysfs. You can define multiple types rather than having creation
>> time parameter and on creation accordingly update 'available_instances'.
>> Mdev also provides a way to provide vendor-specific-attributes for parent
>> physical device as well as for created mdev device. You can add sysfs
>> interface to get input parameters for a mdev device which can be used by
>> vendor driver when open() on that mdev device is called.
>>
>> Thanks,
>> Kirti
>
> Yes. I got my patches to adapt to mdev way. Will be posting RFC v2 soon.
> Will wait for a day to receive more comments/views from Greg and others.
>
> As I explained in this cover-letter and discussion,
> First use case is to create and use mdevs in the host (and not in VM).
> Later on, I am sure once we have mdevs available, VM users will likely use it.
>
> So, mlx5_core driver will have two components as starting point.
>
> 1. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev.c
> This is mdev device life cycle driver which will do, mdev_register_device() and implements mlx5_mdev_ops.
>
Ok. I would suggest not use mdev.c file name, may be add device name,
something like mlx_mdev.c or vfio_mlx.c

> 2. drivers/net/ethernet/mellanox/mlx5/core/mdev/mdev_driver.c
> This is mdev device driver which does mdev_register_driver()
> and probe() creates netdev by heavily reusing existing code of the PF device.
> These drivers will not be placed under drivers/vfio/mdev, because this is not a vfio driver.
> This is fine, right?
>

I'm not too familiar with netdev, but can you create netdev on open()
call on mlx mdev device? Then you don't have to write mdev device driver.


> Given that this is net driver, we will be submitting patches,
> through netdev mailing list through Dave Miller's net-next tree.
> And CC kvm@xxxxxxxxxxxxxxx, you and others as usual.
> Are you ok, merging code this way as mdev device creator and mdev driver.
> Yes?
>

Keep Alex and me in loop.

Thanks,
Kirti