RE: [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API
From: Tian, Kevin
Date: Sun Feb 12 2023 - 21:24:52 EST
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, February 11, 2023 8:45 AM
>
> On Fri, Feb 10, 2023 at 04:51:10PM -0700, Alex Williamson wrote:
> > On Tue, 7 Feb 2023 13:17:54 -0800
> > Nicolin Chen <nicolinc@xxxxxxxxxx> wrote:
> >
> > > qemu has a need to replace the translations associated with a domain
> > > when the guest does large-scale operations like switching between an
> > > IDENTITY domain and, say, dma-iommu.c.
> > >
> > > Currently, it does this by replacing all the mappings in a single
> > > domain, but this is very inefficient and means that domains have to be
> > > per-device rather than per-translation.
> > >
> > > Provide a high-level API to allow replacements of one domain with
> > > another. This is similar to a detach/attach cycle except it doesn't
> > > force the group to go to the blocking domain in-between.
> > >
> > > By removing this forced blocking domain the iommu driver has the
> > > opportunity to implement an atomic replacement of the domains to the
> > > greatest extent its hardware allows.
> > >
> > > It could be possible to adderss this by simply removing the protection
> > > from the iommu_attach_group(), but it is not so clear if that is safe
> > > for the few users. Thus, add a new API to serve this new purpose.
> > >
> > > Atomic replacement allows the qemu emulation of the viommu to be
> more
> > > complete, as real hardware has this ability.
> >
> > I was under the impression that we could not atomically switch a
> > device's domain relative to in-flight DMA.
it's possible as long as the mappings for in-flight DMA don't change
in the transition.
>
> Certainly all the HW can be proper atomic but not necessarily easily -
> the usual issue is a SW complication to manage the software controlled
> cache tags in a way that doesn't corrupt the cache.
>
> This is because the cache tag and the io page table top are in
> different 64 bit words so atomic writes don't cover both, and thus the
> IOMMU HW could tear the two stores and mismatch the cache tag to the
> table top. This would corrupt the cache.
VT-d spec recommends using 128bit cmpxchg instruction to update
page table pointer and DID together.
>
> The easiest way to avoid this is for SW to use the same DID for the
> new and old tables. This is possible if this translation entry is the
> only user of the DID. A more complex way would use a temporary DID
> that can be safely corrupted. But realistically I'd expect VT-d
> drivers to simply make the PASID invalid for the duration of the
> update.
>
> However something like AMD has a single cache tag for every entry in
> the PASID table so you could do an atomic replace trivially. Just
> update the PASID and invalidate the caches.
>
> ARM has a flexible PASID table and atomic replace would be part of
> resizing it. eg you can atomically update the top of the PASID table
> with a single 64 bit store.
>
> So replace lets the drivers implement those special behaviors if it
> makes sense for them.
>
> > Or maybe atomic is the wrong word here since we expect no in-flight DMA
> > during the sort of mode transitions referred to here, and we're really
> > just trying to convey that we can do this via a single operation with
> > reduced latency? Thanks,
>
> atomic means DMA will either translate with the old domain or the new
> domain but never a blocking domain. Keep in mind that with nesting
> "domain" can mean a full PASID table in guest memory.
>
> I should reiterate that replace is not an API that is required to be
> atomic.
>
yes. Just as explained in the commit message:
"
By removing this forced blocking domain the iommu driver has the
opportunity to implement an atomic replacement of the domains to the
greatest extent its hardware allows.
"
API level replace only implies removing transition to/from blocking
domain. and then it's driver's call whether it wants to take this
chance to implement a true 'atomic' behavior.