RE: [PATCH RFC 00/15] Add VFIO mediated device support and IMS support for the idxd driver.

From: Tian, Kevin
Date: Thu Apr 23 2020 - 23:28:52 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxxxx>
> Sent: Friday, April 24, 2020 3:12 AM
>
> On Wed, Apr 22, 2020 at 02:14:36PM -0700, Raj, Ashok wrote:
> > Hi Jason
> >
> > > > >
> > > > > I'm feeling really skeptical that adding all this PCI config space and
> > > > > MMIO BAR emulation to the kernel just to cram this into a VFIO
> > > > > interface is a good idea, that kind of stuff is much safer in
> > > > > userspace.
> > > > >
> > > > > Particularly since vfio is not really needed once a driver is using
> > > > > the PASID stuff. We already have general code for drivers to use to
> > > > > attach a PASID to a mm_struct - and using vfio while disabling all the
> > > > > DMA/iommu config really seems like an abuse.
> > > >
> > > > Well, this series is for virtualizing idxd device to VMs, instead of
> > > > supporting SVA for bare metal processes. idxd implements a
> > > > hardware-assisted mediated device technique called Intel Scalable
> > > > I/O Virtualization,
> > >
> > > I'm familiar with the intel naming scheme.
> > >
> > > > which allows each Assignable Device Interface (ADI, e.g. a work
> > > > queue) tagged with an unique PASID to ensure fine-grained DMA
> > > > isolation when those ADIs are assigned to different VMs. For this
> > > > purpose idxd utilizes the VFIO mdev framework and IOMMU aux-
> domain
> > > > extension. Bare metal SVA will be enabled for idxd later by using
> > > > the general SVA code that you mentioned. Both paths will co-exist
> > > > in the end so there is no such case of disabling DMA/iommu config.
> > >
> > > Again, if you will have a normal SVA interface, there is no need for a
> > > VFIO version, just use normal SVA for both.
> > >
> > > PCI emulation should try to be in userspace, not the kernel, for
> > > security.
> >
> > Not sure we completely understand your proposal. Mediated devices
> > are software constructed and they have protected resources like
> > interrupts and stuff and VFIO already provids abstractions to export
> > to user space.
> >
> > Native SVA is simply passing the process CR3 handle to IOMMU so
> > IOMMU knows how to walk process page tables, kernel handles things
> > like page-faults, doing device tlb invalidations and such.
>
> > That by itself doesn't translate to what a guest typically does
> > with a VDEV. There are other control paths that need to be serviced
> > from the kernel code via VFIO. For speed path operations like
> > ringing doorbells and such they are directly managed from guest.
>
> You don't need vfio to mmap BAR pages to userspace. The unique thing
> that vfio gives is it provides a way to program the classic non-PASID
> iommu, which you are not using here.

That unique thing is indeed used here. Please note sharing CPU virtual
address space with device (what SVA API is invented for) is not the
purpose of this series. We still rely on classic non-PASID iommu programming,
i.e. mapping/unmapping IOVA->HPA per iommu_domain. Although
we do use PASID to tag ADI, the PASID is contained within iommu_domain
and invisible to VFIO. From userspace p.o.v, this is a device passthrough
usage instead of PASID-based address space binding.

>
> > How do you propose to use the existing SVA api's to also provide
> > full device emulation as opposed to using an existing infrastructure
> > that's already in place?
>
> You'd provide the 'full device emulation' in userspace (eg qemu),
> along side all the other device emulation. Device emulation does not
> belong in the kernel without a very good reason.

The problem is that we are not doing full device emulation. It's based
on mediated passthrough. Some emulation logic requires close
engagement with kernel device driver, e.g. resource allocation, WQ
configuration, fault report, etc., while the detail interface is very vendor/
device specific (just like between PF and VF). idxd is just the first
device that supports Scalable IOV. We have a lot more coming later,
in different types. Then putting such emulation in user space means
that Qemu needs to support all those vendor specific interfaces for
every new device which supports Scalable IOV. This is contrast to our
goal of using Scalable IOV as an alternative to SR-IOV. For SR-IOV,
Qemu only needs to support one VFIO API then any VF type simply
works. We want to sustain the same user experience through VFIO
mdev.

Specifically for PCI config space emulation, now it's already done
in multiple kernel places, e.g. vfio-pci, kvmgt, etc. We do plan to
consolidate them later.

>
> You get the doorbell BAR page from your own char dev
>
> You setup a PASID IOMMU configuration over your own char dev
>
> Interrupt delivery is triggering a generic event fd
>
> What is VFIO needed for?

Based on above explanation VFIO mdev already meets all of our
requirements then why bother inventing a new one...

>
> > Perhaps Alex can ease Jason's concerns?
>
> Last we talked Alex also had doubts on what mdev should be used
> for. It is a feature that seems to lack boundaries, and I'll note that
> when the discussion came up for VDPA, they eventually choose not to
> use VFIO.
>

Is there a link to Alex's doubt? I'm not sure why vDPA didn't go
for VFIO, but imho it is a different story. vDPA is specifically for
devices which implement standard vhost/virtio interface, thus
it's reasonable that inventing a new mechanism might be more
efficient for all vDPA type devices. However Scalable IOV is
similar to SR-IOV, only for resource partitioning. It doesn't change
the device programming interface, which could be in any vendor
specific form. Here VFIO mdev is good for providing an unified
interface for managing resource multiplexing of all such devices.

Thanks
Kevin