Re: [RFC] /dev/ioasid uAPI proposal

From: Alex Williamson
Date: Fri Jun 04 2021 - 17:29:29 EST


On Fri, 4 Jun 2021 14:22:07 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:

> On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > >
> > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > to gain extra privilege.
> > >
> > > Okay, fine, lets turn the question on its head then.
> > >
> > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > application can make use of no-snoop optimizations. The ability of KVM
> > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > in a normal process context.
> > >
> > > So, under what conditions do we want to allow VFIO to giave a process
> > > elevated access to the CPU:
> >
> > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > which then would be on VFIO and not on KVM.
>
> At the end of the day we need an ioctl with two arguments:
> - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> - The KVM FD to control wbinvd support on
>
> Philosophically it doesn't matter too much which subsystem that ioctl
> lives, but we have these obnoxious cross module dependencies to
> consider..
>
> Framing the question, as you have, to be about the process, I think
> explains why KVM doesn't really care what is decided, so long as the
> process and the VM have equivalent rights.
>
> Alex, how about a more fleshed out suggestion:
>
> 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> it communicates its no-snoop configuration:

Communicates to whom?

> - 0 enable, allow WBINVD
> - 1 automatic disable, block WBINVD if the platform
> IOMMU can police it (what we do today)
> - 2 force disable, do not allow BINVD ever

The only thing we know about the device is whether or not Enable
No-snoop is hard wired to zero, ie. it either can't generate no-snoop
TLPs ("coherent-only") or it might ("assumed non-coherent"). If
we're putting the policy decision in the hands of userspace they should
have access to wbinvd if they own a device that is assumed
non-coherent AND it's attached to an IOMMU (page table) that is not
blocking no-snoop (a "non-coherent IOASID").

I think that means that the IOASID needs to be created (IOASID_ALLOC)
with a flag that specifies whether this address space is coherent
(IOASID_GET_INFO probably needs a flag/cap to expose if the system
supports this). All mappings in this IOASID would use IOMMU_CACHE and
and devices attached to it would be required to be backed by an IOMMU
capable of IOMMU_CAP_CACHE_COHERENCY (attach fails otherwise). If only
these IOASIDs exist, access to wbinvd would not be provided. (How does
a user provided page table work? - reserved bit set, user error?)

Conversely, a user could create a non-coherent IOASID and attach any
device to it, regardless of IOMMU backing capabilities. Only if an
assumed non-coherent device is attached would the wbinvd be allowed.

I think that means that an EXECUTE_WBINVD ioctl lives on the IOASIDFD
and the IOASID world needs to understand the device's ability to
generate non-coherent DMA. This wbinvd ioctl would be a no-op (or
some known errno) unless a non-coherent IOASID exists with a potentially
non-coherent device attached.

> vfio_pci may want to take this from an admin configuration knob
> someplace. It allows the admin to customize if they want.
>
> If we can figure out a way to autodetect 2 from vfio_pci, all the
> better
>
> 2) There is some IOMMU_EXECUTE_WBINVD IOCTL that allows userspace
> to access wbinvd so it can make use of the no snoop optimization.
>
> wbinvd is allowed when:
> - A device is joined with mode #0
> - A device is joined with mode #1 and the IOMMU cannot block
> no-snoop (today)
>
> 3) The IOASID's don't care about this at all. If IOMMU_EXECUTE_WBINVD
> is blocked and userspace doesn't request to block no-snoop in the
> IOASID then it is a userspace error.

In my model above, the IOASID is central to this.

> 4) The KVM interface is the very simple enable/disable WBINVD.
> Possessing a FD that can do IOMMU_EXECUTE_WBINVD is required
> to enable WBINVD at KVM.

Right, and in the new world order, vfio is only a device driver, the
IOASID manages the device's DMA. wbinvd is only necessary relative to
non-coherent DMA, which seems like QEMU needs to bump KVM with an
ioasidfd.

> It is pretty simple from a /dev/ioasid perpsective, covers todays
> compat requirement, gives some future option to allow the no-snoop
> optimization, and gives a new option for qemu to totally block wbinvd
> no matter what.

What do you imagine is the use case for totally blocking wbinvd? In
the model I describe, wbinvd would always be a no-op/known-errno when
the IOASIDs are all allocated as coherent or a non-coherent IOASID has
only coherent-only devices attached. Does userspace need a way to
prevent itself from scenarios where wbvind is not a no-op?

In general I'm having trouble wrapping my brain around the semantics of
the enable/automatic/force-disable wbinvd specific proposal, sorry.
Thanks,

Alex