Re: [RFC] /dev/ioasid uAPI proposal

From: David Gibson
Date: Thu Jun 03 2021 - 02:28:43 EST


On Wed, Jun 02, 2021 at 02:19:30PM -0300, Jason Gunthorpe wrote:
> On Wed, Jun 02, 2021 at 04:15:07PM +1000, David Gibson wrote:
>
> > Is there a compelling reason to have all the IOASIDs handled by one
> > FD?
>
> There was an answer on this, if every PASID needs an IOASID then there
> are too many FDs.

Too many in what regard? fd limits? Something else?

It seems to be there are two different cases for PASID handling here.
One is where userspace explicitly creates each valid PASID and
attaches a separate pagetable for each (or handles each with
MAP/UNMAP). In that case I wouldn't have expected there to be too
many fds.

Then there's the case where we register a whole PASID table, in which
case I think you only need the one FD. We can treat that as creating
an 84-bit IOAS, whose pagetable format is (PASID table + a bunch of
pagetables for each PASID).

> It is difficult to share the get_user_pages cache across FDs.

Ah... hrm, yes I can see that.

> There are global properties in the /dev/iommu FD, like what devices
> are part of it, that are important for group security operations. This
> becomes confused if it is split to many FDs.

I'm still not seeing those. I'm really not seeing any well-defined
meaning to devices being attached to the fd, but not to a particular
IOAS.

> > > I/O address space can be managed through two protocols, according to
> > > whether the corresponding I/O page table is constructed by the kernel or
> > > the user. When kernel-managed, a dma mapping protocol (similar to
> > > existing VFIO iommu type1) is provided for the user to explicitly specify
> > > how the I/O address space is mapped. Otherwise, a different protocol is
> > > provided for the user to bind an user-managed I/O page table to the
> > > IOMMU, plus necessary commands for iotlb invalidation and I/O fault
> > > handling.
> > >
> > > Pgtable binding protocol can be used only on the child IOASID's, implying
> > > IOASID nesting must be enabled. This is because the kernel doesn't trust
> > > userspace. Nesting allows the kernel to enforce its DMA isolation policy
> > > through the parent IOASID.
> >
> > To clarify, I'm guessing that's a restriction of likely practice,
> > rather than a fundamental API restriction. I can see a couple of
> > theoretical future cases where a user-managed pagetable for a "base"
> > IOASID would be feasible:
> >
> > 1) On some fancy future MMU allowing free nesting, where the kernel
> > would insert an implicit extra layer translating user addresses
> > to physical addresses, and the userspace manages a pagetable with
> > its own VAs being the target AS
>
> I would model this by having a "SVA" parent IOASID. A "SVA" IOASID one
> where the IOVA == process VA and the kernel maintains this mapping.

That makes sense. Needs a different name to avoid Intel and PCI
specificness, but having a trivial "pagetable format" which just says
IOVA == user address is nice idea.

> Since the uAPI is so general I do have a general expecation that the
> drivers/iommu implementations might need to be a bit more complicated,
> like if the HW can optimize certain specific graphs of IOASIDs we
> would still model them as graphs and the HW driver would have to
> "compile" the graph into the optimal hardware.
>
> This approach has worked reasonable in other kernel areas.

That seems sensible.

> > 2) For a purely software virtual device, where its virtual DMA
> > engine can interpet user addresses fine
>
> This also sounds like an SVA IOASID.

Ok.

> Depending on HW if a device can really only bind to a very narrow kind
> of IOASID then it should ask for that (probably platform specific!)
> type during its attachment request to drivers/iommu.
>
> eg "I am special hardware and only know how to do PLATFORM_BLAH
> transactions, give me an IOASID comatible with that". If the only way
> to create "PLATFORM_BLAH" is with a SVA IOASID because BLAH is
> hardwired to the CPU ASID then that is just how it is.

Fair enough.

> > I wonder if there's a way to model this using a nested AS rather than
> > requiring special operations. e.g.
> >
> > 'prereg' IOAS
> > |
> > \- 'rid' IOAS
> > |
> > \- 'pasid' IOAS (maybe)
> >
> > 'prereg' would have a kernel managed pagetable into which (for
> > example) qemu platform code would map all guest memory (using
> > IOASID_MAP_DMA). qemu's vIOMMU driver would then mirror the guest's
> > IO mappings into the 'rid' IOAS in terms of GPA.
> >
> > This wouldn't quite work as is, because the 'prereg' IOAS would have
> > no devices. But we could potentially have another call to mark an
> > IOAS as a purely "preregistration" or pure virtual IOAS. Using that
> > would be an alternative to attaching devices.
>
> It is one option for sure, this is where I was thinking when we were
> talking in the other thread. I think the decision is best
> implementation driven as the datastructure to store the
> preregsitration data should be rather purpose built.

Right. I think this gets nicer now that we're considering more
specific options at IOAS creation time, and different "types" of
IOAS. We could add a "preregistration" IOAS type, which supports
MAP/UNMAP of user addresses, and allows *no* devices to be attached,
but does allow other IOAS types to be added as children.

>
> > > /*
> > > * Map/unmap process virtual addresses to I/O virtual addresses.
> > > *
> > > * Provide VFIO type1 equivalent semantics. Start with the same
> > > * restriction e.g. the unmap size should match those used in the
> > > * original mapping call.
> > > *
> > > * If IOASID_REGISTER_MEMORY has been called, the mapped vaddr
> > > * must be already in the preregistered list.
> > > *
> > > * Input parameters:
> > > * - u32 ioasid;
> > > * - refer to vfio_iommu_type1_dma_{un}map
> > > *
> > > * Return: 0 on success, -errno on failure.
> > > */
> > > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
> >
> > I'm assuming these would be expected to fail if a user managed
> > pagetable has been bound?
>
> Me too, or a SVA page table.
>
> This document would do well to have a list of imagined page table
> types and the set of operations that act on them. I think they are all
> pretty disjoint..

Right. With the possible exception that I can imagine a call for
several types which all support MAP/UNMAP, but have other different
characteristics.

> Your presentation of 'kernel owns the table' vs 'userspace owns the
> table' is a useful clarification to call out too
>
> > > 5. Use Cases and Flows
> > >
> > > Here assume VFIO will support a new model where every bound device
> > > is explicitly listed under /dev/vfio thus a device fd can be acquired w/o
> > > going through legacy container/group interface. For illustration purpose
> > > those devices are just called dev[1...N]:
> > >
> > > device_fd[1...N] = open("/dev/vfio/devices/dev[1...N]", mode);
> >
> > Minor detail, but I'd suggest /dev/vfio/pci/DDDD:BB:SS.F for the
> > filenames for actual PCI functions. Maybe /dev/vfio/mdev/something
> > for mdevs. That leaves other subdirs of /dev/vfio free for future
> > non-PCI device types, and /dev/vfio itself for the legacy group
> > devices.
>
> There are a bunch of nice options here if we go this path
>
> > > 5.2. Multiple IOASIDs (no nesting)
> > > ++++++++++++++++++++++++++++
> > >
> > > Dev1 and dev2 are assigned to the guest. vIOMMU is enabled. Initially
> > > both devices are attached to gpa_ioasid.
> >
> > Doesn't really affect your example, but note that the PAPR IOMMU does
> > not have a passthrough mode, so devices will not initially be attached
> > to gpa_ioasid - they will be unusable for DMA until attached to a
> > gIOVA ioasid.
>
> I think attachment should always be explicit in the API. If the user
> doesn't explicitly ask for a device to be attached to the IOASID then
> the iommu driver is free to block it.
>
> If you want passthrough then you have to create a passthrough IOASID
> and attach every device to it. Some of those attaches might be NOP's
> due to groups.

Agreed.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature