Re: [RFC] /dev/ioasid uAPI proposal
From: Alex Williamson
Date: Mon Jun 07 2021 - 11:42:01 EST
On Fri, 4 Jun 2021 20:01:08 -0300
Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> On Fri, Jun 04, 2021 at 03:29:18PM -0600, Alex Williamson wrote:
> > On Fri, 4 Jun 2021 14:22:07 -0300
> > Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
> >
> > > On Fri, Jun 04, 2021 at 06:10:51PM +0200, Paolo Bonzini wrote:
> > > > On 04/06/21 18:03, Jason Gunthorpe wrote:
> > > > > On Fri, Jun 04, 2021 at 05:57:19PM +0200, Paolo Bonzini wrote:
> > > > > > I don't want a security proof myself; I want to trust VFIO to make the right
> > > > > > judgment and I'm happy to defer to it (via the KVM-VFIO device).
> > > > > >
> > > > > > Given how KVM is just a device driver inside Linux, VMs should be a slightly
> > > > > > more roundabout way to do stuff that is accessible to bare metal; not a way
> > > > > > to gain extra privilege.
> > > > >
> > > > > Okay, fine, lets turn the question on its head then.
> > > > >
> > > > > VFIO should provide a IOCTL VFIO_EXECUTE_WBINVD so that userspace VFIO
> > > > > application can make use of no-snoop optimizations. The ability of KVM
> > > > > to execute wbinvd should be tied to the ability of that IOCTL to run
> > > > > in a normal process context.
> > > > >
> > > > > So, under what conditions do we want to allow VFIO to giave a process
> > > > > elevated access to the CPU:
> > > >
> > > > Ok, I would definitely not want to tie it *only* to CAP_SYS_RAWIO (i.e.
> > > > #2+#3 would be worse than what we have today), but IIUC the proposal (was it
> > > > yours or Kevin's?) was to keep #2 and add #1 with an enable/disable ioctl,
> > > > which then would be on VFIO and not on KVM.
> > >
> > > At the end of the day we need an ioctl with two arguments:
> > > - The 'security proof' FD (ie /dev/vfio/XX, or /dev/ioasid, or whatever)
> > > - The KVM FD to control wbinvd support on
> > >
> > > Philosophically it doesn't matter too much which subsystem that ioctl
> > > lives, but we have these obnoxious cross module dependencies to
> > > consider..
> > >
> > > Framing the question, as you have, to be about the process, I think
> > > explains why KVM doesn't really care what is decided, so long as the
> > > process and the VM have equivalent rights.
> > >
> > > Alex, how about a more fleshed out suggestion:
> > >
> > > 1) When the device is attached to the IOASID via VFIO_ATTACH_IOASID
> > > it communicates its no-snoop configuration:
> >
> > Communicates to whom?
>
> To the /dev/iommu FD which will have to maintain a list of devices
> attached to it internally.
>
> > > - 0 enable, allow WBINVD
> > > - 1 automatic disable, block WBINVD if the platform
> > > IOMMU can police it (what we do today)
> > > - 2 force disable, do not allow BINVD ever
> >
> > The only thing we know about the device is whether or not Enable
> > No-snoop is hard wired to zero, ie. it either can't generate no-snoop
> > TLPs ("coherent-only") or it might ("assumed non-coherent").
>
> Here I am outlining the choice an also imagining we might want an
> admin knob to select the three.
You're calling this an admin knob, which to me suggests a global module
option, so are you trying to implement both an administrator and a user
policy? ie. the user can create scenarios where access to wbinvd might
be justified by hardware/IOMMU configuration, but can be limited by the
admin?
For example I proposed that the ioasidfd would bear the responsibility
of a wbinvd ioctl and therefore validate the user's access to enable
wbinvd emulation w/ KVM, so I'm assuming this module option lives
there. I essentially described the "enable" behavior in my previous
reply, user has access to wbinvd if owning a non-coherent capable
device managed in a non-coherent IOASID. Yes, the user IOASID
configuration controls the latter half of this.
What then is "automatic" mode? The user cannot create a non-coherent
IOASID with a non-coherent device if the IOMMU supports no-snoop
blocking? Do they get a failure? Does it get silently promoted to
coherent?
In "disable" mode, I think we're just narrowing the restriction
further, a non-coherent capable device cannot be used except in a
forced coherent IOASID.
> > If we're putting the policy decision in the hands of userspace they
> > should have access to wbinvd if they own a device that is assumed
> > non-coherent AND it's attached to an IOMMU (page table) that is not
> > blocking no-snoop (a "non-coherent IOASID").
>
> There are two parts here, like Paolo was leading too. If the process
> has access to WBINVD and then if such an allowed process tells KVM to
> turn on WBINVD in the guest.
>
> If the process has a device and it has a way to create a non-coherent
> IOASID, then that process has access to WBINVD.
>
> For security it doesn't matter if the process actually creates the
> non-coherent IOASID or not. An attacker will simply do the steps that
> give access to WBINVD.
Yes, at this point the user has the ability to create a configuration
where they could have access to wbinvd, but if they haven't created
such a configuration, is the wbinvd a no-op?
> The important detail is that access to WBINVD does not compell the
> process to tell KVM to turn on WBINVD. So a qemu with access to WBINVD
> can still choose to create a secure guest by always using IOMMU_CACHE
> in its page tables and not asking KVM to enable WBINVD.
Of course.
> This propsal shifts this policy decision from the kernel to userspace.
> qemu is responsible to determine if KVM should enable wbinvd or not
> based on if it was able to create IOASID's with IOMMU_CACHE.
QEMU is responsible for making sure the VM is consistent; if
non-coherent DMA can occur, wbinvd is emulated. But it's still the
KVM/IOASID connection that validates that access.
> > Conversely, a user could create a non-coherent IOASID and attach any
> > device to it, regardless of IOMMU backing capabilities. Only if an
> > assumed non-coherent device is attached would the wbinvd be allowed.
>
> Right, this is exactly the point. Since the user gets to pick if the
> IOASID is coherent or not then an attacker can always reach WBINVD
> using only the device FD. Additional checks don't add to the security
> of the process.
>
> The additional checks you are describing add to the security of the
> guest, however qemu is capable of doing them without more help from the
> kernel.
>
> It is the strenth of Paolo's model that KVM should not be able to do
> optionally less, not more than the process itself can do.
I think my previous reply was working towards those guidelines. I feel
like we're mostly in agreement, but perhaps reading past each other.
Nothing here convinced me against my previous proposal that the
ioasidfd bears responsibility for managing access to a wbinvd ioctl,
and therefore the equivalent KVM access. Whether wbinvd is allowed or
no-op'd when the use has access to a non-coherent device in a
configuration where the IOMMU prevents non-coherent DMA is maybe still
a matter of personal preference.
> > > It is pretty simple from a /dev/ioasid perpsective, covers todays
> > > compat requirement, gives some future option to allow the no-snoop
> > > optimization, and gives a new option for qemu to totally block wbinvd
> > > no matter what.
> >
> > What do you imagine is the use case for totally blocking wbinvd?
>
> If wbinvd is really security important then an operator should endevor
> to turn it off. It can be safely turned off if the operator
> understands the SRIOV devices they are using. ie if you are only using
> mlx5 or a nvme then force it off and be secure, regardless of the
> platform capability.
Ok, I'm not opposed to something like a module option that restricts to
only coherent DMA, but we need to work through how that's exposed and
the userspace behavior. The most obvious would be that a GET_INFO
ioctl on the ioasidfd indicates the restrictions, a flag on the IOASID
alloc indicates the coherency of the IOASID, and we fail any cases
where the admin policy or hardware support doesn't match (ie. alloc if
it's incompatible with policy, attach if the device/IOMMU backing
violates policy). This is all a compatible layer with what I described
previously. Thanks,
Alex