Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
From: Jason Gunthorpe
Date: Tue Apr 27 2021 - 13:24:38 EST
On Tue, Apr 27, 2021 at 02:50:45PM +1000, David Gibson wrote:
> > > I say this because the SPAPR looks quite a lot like PASID when it has
> > > APIs for allocating multiple tables and other things. I would be
> > > interested to hear someone from IBM talk about what it is doing and
> > > how it doesn't fit into today's IOMMU API.
>
> Hm. I don't think it's really like PASID. Just like Type1, the TCE
> backend represents a single DMA address space which all devices in the
> container will see at all times. The difference is that there can be
> multiple (well, 2) "windows" of valid IOVAs within that address space.
> Each window can have a different TCE (page table) layout. For kernel
> drivers, a smallish translated window at IOVA 0 is used for 32-bit
> devices, and a large direct mapped (no page table) window is created
> at a high IOVA for better performance with 64-bit DMA capable devices.
>
> With the VFIO backend we create (but don't populate) a similar
> smallish 32-bit window, userspace can create its own secondary window
> if it likes, though obvious for userspace use there will always be a
> page table. Userspace can choose the total size (but not address),
> page size and to an extent the page table format of the created
> window. Note that the TCE page table format is *not* the same as the
> POWER CPU core's page table format. Userspace can also remove the
> default small window and create its own.
So what do you need from the generic API? I'd suggest if userspace
passes in the required IOVA range it would benefit all the IOMMU
drivers to setup properly sized page tables and PPC could use that to
drive a single window. I notice this is all DPDK did to support TCE.
> The second wrinkle is pre-registration. That lets userspace register
> certain userspace VA ranges (*not* IOVA ranges) as being the only ones
> allowed to be mapped into the IOMMU. This is a performance
> optimization, because on pre-registration we also pre-account memory
> that will be effectively locked by DMA mappings, rather than doing it
> at DMA map and unmap time.
This feels like nesting IOASIDs to me, much like a vPASID.
The pre-registered VA range would be the root of the tree and the
vIOMMU created ones would be children of the tree. This could allow
the map operations of the child to refer to already prepped physical
memory held in the root IOASID avoiding the GUP/etc cost.
Seems fairly genericish, though I'm not sure about the kvm linkage..
> I like the idea of a common DMA/IOMMU handling system across
> platforms. However in order to be efficiently usable for POWER it
> will need to include multiple windows, allowing the user to change
> those windows and something like pre-registration to amortize
> accounting costs for heavy vIOMMU load.
I have a feeling /dev/ioasid is going to end up with some HW specific
escape hatch to create some HW specific IOASID types and operate on
them in a HW specific way.
However, what I would like to see is that something simple like DPDK
can have a single implementation - POWER should implement the standard
operations and map them to something that will work for it.
As an ideal, only things like the HW specific qemu vIOMMU driver
should be reaching for all the special stuff.
In this way the kernel IOMMU driver and the qemu user vIOMMU driver
would form something of a classical split user/kernel driver pattern.
Jason