RE: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

From: Tian, Kevin
Date: Wed Feb 10 2021 - 02:54:21 EST


Hi, Max,

> From: Max Gurtovoy <mgurtovoy@xxxxxxxxxx>
> Sent: Tuesday, February 2, 2021 12:28 AM
>
> Hi Alex and Cornelia,
>
> This series split the vfio_pci driver into 2 parts: pci driver and a
> subsystem driver that will also be library of code. The pci driver,
> vfio_pci.ko will be used as before and it will bind to the subsystem
> driver vfio_pci_core.ko to register to the VFIO subsystem. This patchset
> if fully backward compatible. This is a typical Linux subsystem
> framework behaviour. This framework can be also adopted by vfio_mdev
> devices as we'll see in the below sketch.
>
> This series is coming to solve the issues that were raised in the
> previous attempt for extending vfio-pci for vendor specific
> functionality: https://lkml.org/lkml/2020/5/17/376 by Yan Zhao.
>
> This solution is also deterministic in a sense that when a user will
> bind to a vendor specific vfio-pci driver, it will get all the special
> goodies of the HW.
>
> This subsystem framework will also ease on adding vendor specific
> functionality to VFIO devices in the future by allowing another module
> to provide the pci_driver that can setup number of details before
> registering to VFIO subsystem (such as inject its own operations).

I'm a bit confused about the change from v1 to v2, especially about
how to inject module specific operations. From live migration p.o.v
it may requires two hook points at least for some devices (e.g. i40e
in original Yan's example): register a migration region and intercept
guest writes to specific registers. [PATCH 4/9] demonstrates the
former but not the latter (which is allowed in v1). I saw some concerns
about reporting struct vfio_pci_device outside of vfio-pci-core in v1
which should be the main reason driving this change. But I'm still
curious to know how we plan to address this requirement (allowing
vendor driver to tweak specific vfio_device_ops) moving forward.

Then another question. Once we have this framework in place, do we
mandate this approach for any vendor specific tweak or still allow
doing it as vfio_pci_core extensions (such as igd and zdev in this series)?
If the latter, what is the criteria to judge which way is desired? Also what
about the scenarios where we just want one-time vendor information,
e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
Do we expect to create a module for each device to provide such info?
Having those questions answered is helpful for better understanding of
this proposal IMO. 😊

[1] https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fdd00@xxxxxxxxxx/T/
[2] https://lore.kernel.org/kvm/20200407095801.648b1371@xxxxxxxxx/

>
> Below we can see the proposed changes (this patchset only deals with
> VFIO_PCI subsystem but it can easily be extended to VFIO_MDEV subsystem
> as well):
>
> +------------------------------------------------------------------+
> | |
> | VFIO |
> | |
> +------------------------------------------------------------------+
>
> +------------------------------+ +------------------------------+
> | | | |
> | VFIO_PCI_CORE | | VFIO_MDEV_CORE |
> | | | |
> +------------------------------+ +------------------------------+
>
> +--------------+ +-------------+ +-------------+ +--------------+
> | | | | | | | |
> | | | | | | | |
> | VFIO_PCI | |MLX5_VFIO_PCI| | VFIO_MDEV | |MLX5_VFIO_MDEV|
> | | | | | | | |
> | | | | | | | |
> +--------------+ +-------------+ +-------------+ +--------------+
>

The VFIO_PCI part is clear but I didn't get how this concept is applied
to VFIO_MDEV. the mdev sub-system looks like below in my mind:

+------------------------------------------------------------------+
| |
| VFIO |
| |
+------------------------------------------------------------------+

+------------------------------+ +------------------------------+
| | | |
| VFIO_PCI_CORE | | VFIO_MDEV |
| | | |
+------------------------------+ +------------------------------+

+--------------+ +-------------+ +------------------------------+
| | | | | |
| | | | | |
| VFIO_PCI | |MLX5_VFIO_PCI| | MDEV CORE |
| | | | | |
| | | | | |
+--------------+ +-------------+ +------------------------------+

+------------------------------+
| |
| |
| MLX5-MDEV |
| |
| |
+------------------------------+
MDEV core is already a well defined subsystem to connect mdev
bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).
vfio-mdev is just the channel to bring VFIO APIs through mdev core
to underlying vendor specific mdev device driver, which is already
granted flexibility to tweak whatever needs through mdev_parent_ops.
Then what exact extension is talked here by creating another subsystem
module? or are we talking about some general library which can be
shared by underlying mdev device drivers to reduce duplicated
emulation code?

> First 3 patches introduce the above changes for vfio_pci and
> vfio_pci_core.
>
> Patch (4/9) introduces a new mlx5 vfio-pci module that registers to VFIO
> subsystem using vfio_pci_core. It also registers to Auxiliary bus for
> binding to mlx5_core that is the parent of mlx5-vfio-pci devices. This
> will allow extending mlx5-vfio-pci devices with HW specific features
> such as Live Migration (mlx5_core patches are not part of this series
> that comes for proposing the changes need for the vfio pci subsystem).
>
> For our testing and development we used 2 VirtIO-BLK physical functions
> based on NVIDIAs Bluefield-2 SNAP technology. These 2 PCI functions were
> enumerated as 08:00.0 and 09:00.0 by the Hypervisor.
>
> The Bluefield-2 device driver, mlx5_core, will create auxiliary devices
> for each VirtIO-BLK SNAP PF (the "parent"/"manager" of each VirtIO-BLK PF
> will actually issue auxiliary device creation).
>
> These auxiliary devices will be seen on the Auxiliary bus as:
> mlx5_core.vfio_pci.2048 -
> > ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:0
> 7:00.0/mlx5_core.vfio_pci.2048
> mlx5_core.vfio_pci.2304 -
> > ../../../devices/pci0000:00/0000:00:02.0/0000:05:00.0/0000:06:00.0/0000:0
> 7:00.1/mlx5_core.vfio_pci.2304
>
> 2048 represents BDF 08:00.0 (parent is 0000:07:00.0 Bluefield-2 p0) and
> 2304 represents BDF 09:00.0 (parent is 0000:07:00.1 Bluefield-2 p1) in
> decimal view. In this manner, the administrator will be able to locate the
> correct vfio-pci module it should bind the desired BDF to (by finding
> the pointer to the module according to the Auxiliary driver of that BDF).
>
> Note: The discovery mechanism we used for the RFC might need some
> improvements as mentioned in the TODO list.
>
> In this way, we'll use the HW vendor driver core to manage the lifecycle
> of these devices. This is reasonable since only the vendor driver knows
> exactly about the status on its internal state and the capabilities of
> its acceleratots, for example.
>
> changes from v1:
> - Create a private and public vfio-pci structures (From Alex)
> - register to vfio core directly from vfio-pci-core (for now, expose
> minimal public funcionality to vfio pci drivers. This will remove the
> implicit behaviour from v1. More power to the drivers can be added in
> the future)
> - Add patch 3/9 to emphasize the needed extension for LM feature (From
> Cornelia)
> - take/release refcount for the pci module during core open/release
> - update nvlink, igd and zdev to PowerNV, X86 and s390 extensions for
> vfio-pci core
> - fix segfault bugs in current vfio-pci zdev code
>
> TODOs:
> 1. Create subsystem module for VFIO_MDEV. This can be used for vendor
> specific scalable functions for example (SFs).
> 2. Add Live migration functionality for mlx5 SNAP devices
> (NVMe/Virtio-BLK).
> 3. Add Live migration functionality for mlx5 VFs
> 4. Add the needed functionality for mlx5_core
> 5. Improve userspace and discovery
> 6. move VGA stuff to x86 extension
>
> I would like to thank the great team that was involved in this
> development, design and internal review:
> Oren, Liran, Jason, Leon, Aviad, Shahaf, Gary, Artem, Kirti, Neo, Andy
> and others.
>
> This series applies cleanly on top of kernel 5.11-rc5+ commit 13391c60da33:
> "Merge branch 'linus' of
> git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6"
> from Linus.
>
> Note: Live migration for MLX5 SNAP devices is WIP and can be the first
> example for adding vendor extension to vfio-pci devices. As the
> changes to the subsystem must be defined as a pre-condition for
> this work, we've decided to split the submission for now.
>
> Max Gurtovoy (9):
> vfio-pci: rename vfio_pci.c to vfio_pci_core.c
> vfio-pci: introduce vfio_pci_core subsystem driver
> vfio-pci-core: export vfio_pci_register_dev_region function
> mlx5-vfio-pci: add new vfio_pci driver for mlx5 devices
> vfio-pci/zdev: remove unused vdev argument
> vfio-pci/zdev: fix possible segmentation fault issue
> vfio/pci: use s390 naming instead of zdev
> vfio/pci: use x86 naming instead of igd
> vfio/pci: use powernv naming instead of nvlink2
>
> drivers/vfio/pci/Kconfig | 57 +-
> drivers/vfio/pci/Makefile | 16 +-
> drivers/vfio/pci/mlx5_vfio_pci.c | 253 ++
> drivers/vfio/pci/vfio_pci.c | 2384 +----------------
> drivers/vfio/pci/vfio_pci_config.c | 56 +-
> drivers/vfio/pci/vfio_pci_core.c | 2326 ++++++++++++++++
> drivers/vfio/pci/vfio_pci_core.h | 73 +
> drivers/vfio/pci/vfio_pci_intrs.c | 22 +-
> ...{vfio_pci_nvlink2.c => vfio_pci_powernv.c} | 47 +-
> drivers/vfio/pci/vfio_pci_private.h | 44 +-
> drivers/vfio/pci/vfio_pci_rdwr.c | 14 +-
> .../pci/{vfio_pci_zdev.c => vfio_pci_s390.c} | 28 +-
> .../pci/{vfio_pci_igd.c => vfio_pci_x86.c} | 18 +-
> include/linux/mlx5/vfio_pci.h | 36 +
> 14 files changed, 2916 insertions(+), 2458 deletions(-)
> create mode 100644 drivers/vfio/pci/mlx5_vfio_pci.c
> create mode 100644 drivers/vfio/pci/vfio_pci_core.c
> create mode 100644 drivers/vfio/pci/vfio_pci_core.h
> rename drivers/vfio/pci/{vfio_pci_nvlink2.c => vfio_pci_powernv.c} (89%)
> rename drivers/vfio/pci/{vfio_pci_zdev.c => vfio_pci_s390.c} (80%)
> rename drivers/vfio/pci/{vfio_pci_igd.c => vfio_pci_x86.c} (89%)
> create mode 100644 include/linux/mlx5/vfio_pci.h
>
> --
> 2.25.4

Thanks
Kevin