Re: [PATCH v1 02/14] iommufd: Add nesting related data structures for ARM SMMUv3

From: Jean-Philippe Brucker
Date: Thu Mar 09 2023 - 13:27:06 EST


On Thu, Mar 09, 2023 at 10:48:50AM -0400, Jason Gunthorpe wrote:
> On Thu, Mar 09, 2023 at 01:42:17PM +0000, Jean-Philippe Brucker wrote:
>
> > Although we can keep the alloc and hardware info separate for each IOMMU
> > architecture, we should try to come up with common invalidation methods.
>
> The invalidation language is tightly linked to the actual cache block
> and cache tag in the IOMMU HW design.

Concretely though, what are the incompatibilities between the HW designs?
They all need to remove a range of TLB entries, using some address space
tag. But if there is an actual difference I do need to know.

> Generality will loose or
> obfuscate the necessary specificity that is required for creating real
> vIOMMUs.
>
> Further, invalidation is a fast path, it is crazy to take a vIOMMU of
> a real HW receving a native invalidation request, mangle it to some
> obfuscated kernel version and then de-mangle it again in the kernel
> driver. IMHO ideally qemu will simply point the invalidation at the
> WQE in the SW vIOMMU command queue and invoke the ioctl. (Nicolin, we
> should check more into this)

Avoiding copying a few bytes won't make up for the extra context switches
to userspace. An emulated IOMMU can easily decode commands and translate
them to generic kernel structures, in a handful of CPU cycles, just like
they decode STEs. It's what they do, and it's the opposite of obfuscation.

>
> The purpose of these interfaces is to support high performance full
> functionality vIOMMUs of the real HW, we should not loose sight of
> that goal.
>
> We are actually planning to go futher and expose direct invalidation
> work queues complete with HW doorbells to userspace. This obviously
> must be in native HW format.

Doesn't seem relevant since direct access to command queue wouldn't use
this struct.

>
> Nicolin, I think we should tweak the uAPI here so that the
> invalidation opaque data has a format tagged on its own, instead of
> re-using the HWPT tag. Ie you can have a ARM SMMUv3 invalidate type
> tag and also a virtio-viommu invalidate type tag.
>
> This will allow Jean to put the invalidation decoding in the iommu
> drivers if we think that is the right direction. virtio can
> standardize it as a "HW format".
>
> > Ideally I'd like something like this for vhost-iommu:
> >
> > * slow path through userspace for complex requests like attach-table and
> > probe, where the VMM can decode arch-specific information and translate
> > them to iommufd and vhost-iommu ioctls to update the configuration.
> >
> > * fast path within the kernel for performance-critical requests like
> > invalidate, page request and response. It would be absurd for the
> > vhost-iommu driver to translate generic invalidation requests from the
> > guest into arch-specific commands with special opcodes, when the next
> > step is calling the IOMMU driver which does that for free.
>
> Someone has to do the conversion. If you don't think virito should do
> it then I'd be OK to add a type tag for virtio format invalidation and
> put it in the IOMMU driver.

Implementing two invalidation formats in each IOMMU driver does not seem
practical.

>
> But given virtio overall already has to know *alot* about how the HW
> it is wrapping works I don't think it is necessarily absurd for virtio
> to do the conversion. I'd like to evaluate this in patches in context
> with how much other unique HW code ends up in kernel-side vhost-iommu.

Ideally none. I'd rather leave those, attach and probe, in userspace and
if possible compatible with iommufd to avoid register decoding.

>
> However, I don't know the rational for virtio-viommu, it seems like a
> strange direction to me.

A couple of reasons are relevant here: non-QEMU VMMs don't want to emulate
all vendor IOMMUs, new architectures get vIOMMU mostly for free, and vhost
provides a faster path. Also the ability to optimize paravirtual
interfaces for things like combined invalidation (IOTLB+ATC) or, later,
nested page requests.

For a while the main vIOMMU use-case was assignment to guest userspace,
mainly DPDK, which works great with a generic and slow map/unmap
interface. Since vSVA is still a niche use-case, and nesting without page
faults requires pinning the whole guest memory, map/unmap still seems more
desirable to me. But there is some renewed interest in supporting page
tables with virtio-iommu for the reasons above.

> All the iommu drivers have native command
> queues. ARM and AMD are both supporting native command queues directly
> in the guest, complete with a direct guest MMIO doorbell ring.

Arm SMMUv3 mandates a single global command queue (SMMUv2 uses registers).
An SMMUv3 can optionally implement multiple command queues, though I don't
know if they can be safely assigned to guests. For a lot of SMMUv3
implementations that have a single queue and for other architectures, we
can do better than hardware emulation.

>
> If someone wants to optimize this I'd think the way to do it is to use
> virtio like techniques to put SW command queue processing in the
> kernel iommu driver and continue to use the HW native interface in the
> VM.

I didn't get which kernel this is.

>
> What benifit comes from replacing the HW native interface with virtio?
> Especially on ARM where the native interface is pretty clean?
>
> > During previous discussions we came up with generic invalidations that
> > could fit both Arm and x86 [1][2]. The only difference was the ASID
> > (called archid/id in those proposals) which VT-d didn't need. Could we try
> > to build on that?
>
> IMHO this was just unioning all the different invalidation types
> together. It makes sense for something like virtio but it is
> illogical/obfuscated as a user/kernel interface since it still
> requires a userspace HW driver to understand what subset of the
> invalidations are used on the actual HW.

As above, decoding arch-specific structures into generic ones is what an
emulated IOMMU does, and it doesn't make a performance difference in which
format it forwards that to the kernel. The host IOMMU driver checks the
guest request and copies them into the command queue. Whether that request
comes in the form of a structure binary-compatible with Arm SMMUvX.Y, or
some generic structure, does not make a difference.

Thanks,
Jean