Re: [PATCH v2 0/9] Introduce vfio-pci-core subsystem

From: Alex Williamson
Date: Wed Feb 10 2021 - 11:43:28 EST


On Wed, 10 Feb 2021 09:34:52 -0400
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Wed, Feb 10, 2021 at 07:52:08AM +0000, Tian, Kevin wrote:
> > > This subsystem framework will also ease on adding vendor specific
> > > functionality to VFIO devices in the future by allowing another module
> > > to provide the pci_driver that can setup number of details before
> > > registering to VFIO subsystem (such as inject its own operations).
> >
> > I'm a bit confused about the change from v1 to v2, especially about
> > how to inject module specific operations. From live migration p.o.v
> > it may requires two hook points at least for some devices (e.g. i40e
> > in original Yan's example):
>
> IMHO, it was too soon to give up on putting the vfio_device_ops in the
> final driver- we should try to define a reasonable public/private
> split of vfio_pci_device as is the norm in the kernel. No reason we
> can't achieve that.
>
> > register a migration region and intercept guest writes to specific
> > registers. [PATCH 4/9] demonstrates the former but not the latter
> > (which is allowed in v1).
>
> And this is why, the ROI to wrapper every vfio op in a PCI op just to
> keep vfio_pci_device completely private is poor :(

Says someone who doesn't need to maintain the core, fixing bugs and
adding features, while not breaking vendor driver touching private data
in unexpected ways ;)

> > Then another question. Once we have this framework in place, do we
> > mandate this approach for any vendor specific tweak or still allow
> > doing it as vfio_pci_core extensions (such as igd and zdev in this
> > series)?
>
> I would say no to any further vfio_pci_core extensions that are tied
> to specific PCI devices. Things like zdev are platform features, they
> are not tied to specific PCI devices
>
> > If the latter, what is the criteria to judge which way is desired? Also what
> > about the scenarios where we just want one-time vendor information,
> > e.g. to tell whether a device can tolerate arbitrary I/O page faults [1] or
> > the offset in VF PCI config space to put PASID/ATS/PRI capabilities [2]?
> > Do we expect to create a module for each device to provide such info?
> > Having those questions answered is helpful for better understanding of
> > this proposal IMO. 😊
> >
> > [1] https://lore.kernel.org/kvm/d4c51504-24ed-2592-37b4-f390b97fdd00@xxxxxxxxxx/T/
>
> SVA is a platform feature, so no problem. Don't see a vfio-pci change
> in here?
>
> > [2] https://lore.kernel.org/kvm/20200407095801.648b1371@xxxxxxxxx/
>
> This one could have been done as a broadcom_vfio_pci driver. Not sure
> exposing the entire config space unprotected is safe, hard to know
> what the device has put in there, and if it is secure to share with a
> guest..

The same argument could be made about the whole device, not just config
space. In fact we know that devices like GPUs provide access to config
space through other address spaces, I/O port and MMIO BARs. Config
space is only the architected means to manipulate the link interface
and device features, we can't determine if it's the only means. Any
emulation or access restrictions we put in config space are meant to
make the device work better for the user (ex. re-using host kernel
device quirks) or prevent casual misconfiguration (ex. incompatible mps
settings, BAR registers), the safety of assigning a device to a user
is only as good as the isolation and error handling that the platform
allows.


> > MDEV core is already a well defined subsystem to connect mdev
> > bus driver (vfio-mdev) and mdev device driver (mlx5-mdev).
>
> mdev is two things
>
> - a driver core bus layer and sysfs that makes a lifetime model
> - a vfio bus driver that doesn't do anything but forward ops to the
> main ops
>
> > vfio-mdev is just the channel to bring VFIO APIs through mdev core
> > to underlying vendor specific mdev device driver, which is already
> > granted flexibility to tweak whatever needs through mdev_parent_ops.
>
> This is the second thing, and it could just be deleted. The actual
> final mdev driver can just use vfio_device_ops directly. The
> redirection shim in vfio_mdev.c doesn't add value.
>
> > Then what exact extension is talked here by creating another subsystem
> > module? or are we talking about some general library which can be
> > shared by underlying mdev device drivers to reduce duplicated
> > emulation code?
>
> IMHO it is more a design philosophy that the end driver should
> implement the vfio_device_ops directly vs having a stack of ops
> structs.

And that's where turning vfio-pci into a vfio-pci-core library comes
in, lowering the bar for a vendor driver to re-use what vfio-pci has
already done. This doesn't change the basic vfio architecture that
already allows any vendor driver to register with its own
vfio_device_ops, it's just code-reuse, which only makes sense if we're
not just shifting the support burden from the vendor driver to the new
library-ized core. Like Kevin though, I don't really understand the
hand-wave application to mdev. Sure, vfio-mdev could be collapsed now
that we've rejected that there could be other drivers binding to mdev
devices, but mdev-core provides a common lifecycle management within a
class of devices, so if you subscribe to mdev, your vendor driver is
providing essentially vfio_device_ops plus the mdev extra callbacks.
If your vendor driver doesn't subscribe to that lifecycle, then just
implement a vfio bus driver with your own vfio_device_ops. The nature
of that mdev lifecycle doesn't seem to mesh well with a library that
expects to manage a whole physical PCI device though. Thanks,

Alex