Re: [RFC] /dev/ioasid uAPI proposal

From: Jason Gunthorpe
Date: Thu Jun 03 2021 - 08:34:08 EST


On Wed, Jun 02, 2021 at 08:50:54PM -0600, Alex Williamson wrote:
> On Wed, 2 Jun 2021 19:45:36 -0300
> Jason Gunthorpe <jgg@xxxxxxxxxx> wrote:
>
> > On Wed, Jun 02, 2021 at 02:37:34PM -0600, Alex Williamson wrote:
> >
> > > Right. I don't follow where you're jumping to relaying DMA_PTE_SNP
> > > from the guest page table... what page table?
> >
> > I see my confusion now, the phrasing in your earlier remark led me
> > think this was about allowing the no-snoop performance enhancement in
> > some restricted way.
> >
> > It is really about blocking no-snoop 100% of the time and then
> > disabling the dangerous wbinvd when the block is successful.
> >
> > Didn't closely read the kvm code :\
> >
> > If it was about allowing the optimization then I'd expect the guest to
> > enable no-snoopable regions via it's vIOMMU and realize them to the
> > hypervisor and plumb the whole thing through. Hence my remark about
> > the guest page tables..
> >
> > So really the test is just 'were we able to block it' ?
>
> Yup. Do we really still consider that there's some performance benefit
> to be had by enabling a device to use no-snoop? This seems largely a
> legacy thing.

I've recently had some no-snoopy discussions lately.. The issue didn't
vanish, it is still expensive going through all that cache hardware.

> > But Ok, back the /dev/ioasid. This answers a few lingering questions I
> > had..
> >
> > 1) Mixing IOMMU_CAP_CACHE_COHERENCY and !IOMMU_CAP_CACHE_COHERENCY
> > domains.
> >
> > This doesn't actually matter. If you mix them together then kvm
> > will turn on wbinvd anyhow, so we don't need to use the DMA_PTE_SNP
> > anywhere in this VM.
> >
> > This if two IOMMU's are joined together into a single /dev/ioasid
> > then we can just make them both pretend to be
> > !IOMMU_CAP_CACHE_COHERENCY and both not set IOMMU_CACHE.
>
> Yes and no. Yes, if any domain is !IOMMU_CAP_CACHE_COHERENCY then we
> need to emulate wbinvd, but no we'll use IOMMU_CACHE any time it's
> available based on the per domain support available. That gives us the
> most consistent behavior, ie. we don't have VMs emulating wbinvd
> because they used to have a device attached where the domain required
> it and we can't atomically remap with new flags to perform the same as
> a VM that never had that device attached in the first place.

I think we are saying the same thing..

> > 2) How to fit this part of kvm in some new /dev/ioasid world
> >
> > What we want to do here is iterate over every ioasid associated
> > with the group fd that is passed into kvm.
>
> Yeah, we need some better names, binding a device to an ioasid (fd) but
> then attaching a device to an allocated ioasid (non-fd)... I assume
> you're talking about the latter ioasid.

Fingers crossed on RFCv2.. Here I mean the IOASID object inside the
/dev/iommu FD. The vfio_device would have some kref handle to the
in-kernel representation of it. So we can interact with it..

> > Or perhaps more directly: an op attaching the vfio_device to the
> > kvm and having some simple helper
> > '(un)register ioasid with kvm (kvm, ioasid)'
> > that the vfio_device driver can call that just sorts this out.
>
> We could almost eliminate the device notion altogether here, use an
> ioasidfd_for_each_ioasid() but we really want a way to trigger on each
> change to the composition of the device set for the ioasid, which is
> why we currently do it on addition or removal of a group, where the
> group has a consistent set of IOMMU properties.

That is another quite good option, just forget about trying to be
highly specific and feed in the /dev/ioasid FD and have kvm ask "does
anything in here not enforce snoop?"

With something appropriate to track/block changing that answer.

It doesn't solve the problem to connect kvm to AP and kvmgt though

Jason