RE: [RFC 10/20] iommu/iommufd: Add IOMMU_DEVICE_GET_INFO
From: Tian, Kevin
Date: Wed Oct 20 2021 - 22:26:14 EST
> From: Tian, Kevin
> Sent: Friday, October 15, 2021 9:02 AM
>
> > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > Sent: Thursday, October 14, 2021 11:43 PM
> >
> > > > > I think the key is whether other archs allow driver to decide DMA
> > > > > coherency and indirectly the underlying I/O page table format.
> > > > > If yes, then I don't see a reason why such decision should not be
> > > > > given to userspace for passthrough case.
> > > >
> > > > The choice all comes down to if the other arches have cache
> > > > maintenance instructions in the VM that *don't work*
> > >
> > > Looks vfio always sets IOMMU_CACHE on all platforms as long as
> > > iommu supports it (true on all platforms except intel iommu which
> > > is dedicated for GPU):
> > >
> > > vfio_iommu_type1_attach_group()
> > > {
> > > ...
> > > if (iommu_capable(bus, IOMMU_CAP_CACHE_COHERENCY))
> > > domain->prot |= IOMMU_CACHE;
> > > ...
> > > }
> > >
> > > Should above be set according to whether a device is coherent?
> >
> > For IOMMU_CACHE there are two questions related to the overloaded
> > meaning:
> >
> > - Should VFIO ask the IOMMU to use non-coherent DMA (ARM meaning)
> > This depends on how the VFIO user expects to operate the DMA.
> > If the VFIO user can issue cache maintenance ops then IOMMU_CACHE
> > should be controlled by the user. I have no idea what platforms
> > support user space cache maintenance ops.
>
> But just like you said for intel meaning below, even if those ops are
> privileged a uAPI can be provided to support such usage if necessary.
>
> >
> > - Should VFIO ask the IOMMU to suppress no-snoop (Intel meaning)
> > This depends if the VFIO user has access to wbinvd or not.
> >
> > wbinvd is a privileged instruction so normally userspace will not
> > be able to access it.
> >
> > Per Paolo recommendation there should be a uAPI someplace that
> > allows userspace to issue wbinvd - basically the suppress no-snoop
> > is also user controllable.
> >
> > The two things are very similar and ultimately are a choice userspace
> > should be making.
>
> yes
>
> >
> > From something like a qemu perspective things are more murkey - eg on
> > ARM qemu needs to co-ordinate with the guest. Whatever IOMMU_CACHE
> > mode VFIO is using must match the device coherent flag in the Linux
> > guest. I'm guessing all Linux guest VMs only use coherent DMA for all
> > devices today. I don't know if the cache maintaince ops are even
> > permitted in an ARM VM.
> >
>
> I'll leave it to Jean to confirm. If only coherent DMA can be used in
> the guest on other platforms, suppose VFIO should not blindly set
> IOMMU_CACHE and in concept it should deny assigning a non-coherent
> device since no co-ordination with guest exists today.
Jean, what's your opinion?
>
> So the bottomline is that we'll keep this no-snoop thing Intel-specific.
> For the basic skeleton we'll not support no-snoop thus the user
> needs to set enforce-snoop flag when creating an IOAS like this RFC v1
> does. Also need to introduce a new flag instead of abusing
> IOMMU_CACHE in the kernel. For other platforms it may need a fix
> to deny non-coherent device (based on above open) for now.
>
Jason, want to check whether another option works here.
The current proposal lets the user to choose whether the I/O page
table should be put in an enforced-snoop format, with the assumption
that the user may have better knowledge than the kernel to know the
no-snoop requirement. This leads to the current design which exposes
whether an IOMMU behind a device supports enforce-snoop via
IOMMU_DEVICE_GET_INFO to the user and then have the user to
set/clear the enforce-snoop flag in IOMMU_IOASID_ALLOC.
This makes sense if there are no-snoop devices behind an IOMMU
supporting enforce-snoop.
But in reality only Intel integrated GPUs have this special no-snoop
trick (fixed knowledge), with a dedicated IOMMU which doesn't
support enforce-snoop format at all. In this case there is no choice
that the user can further make.
Also per Christoph's comment no-snoop is not an encouraged
usage overall.
Given that I wonder whether the current vfio model better suites
for this corner case, i.e. just let the kernel to handle instead of
exposing it in uAPI. The simple policy (as vfio does) is to automatically
set enforce-snoop when the target IOMMU supports it, otherwise
enable vfio/kvm contract to handle no-snoop requirement.
I don't see any interest in implementing an Intel GPU driver fully
in userspace. If just talking about possibility, a separate uAPI can
be still introduced to allow the userspace to issue wbinvd as Paolo
suggested.
One side-effect of doing so is that then we may have to support
multiple domains per IOAS when Intel GPU and other devices are
attached to the same IOAS. But this doesn't have to be implemented
in the basic skeleton now. Can be extended later when we start
working on Intel GPU support. And overall it also improves
performance otherwise the user has to create two duplicated IOAS's
(one for GPU, one for other devices) if assuming one domain per
IOAS then every map request must be done twice in both IOAS's.
Does this option make sense?
btw fixing the abuse of IOMMU_CACHE is orthogonal to this uAPI
open anyway.
Thanks
Kevin