RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

From: Tian, Kevin
Date: Fri Apr 23 2021 - 05:06:58 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, April 22, 2021 8:10 PM
>
> On Thu, Apr 22, 2021 at 08:34:32AM +0000, Tian, Kevin wrote:
>
> > Another tricky thing is that a container may be linked to multiple iommu
> > domains in VFIO, as devices in the container may locate behind different
> > IOMMUs with inconsistent capability (commit 1ef3e2bc).
>
> Frankly this sounds over complicated. I would think /dev/ioasid should
> select the IOMMU when the first device is joined, and all future joins
> must be compatible with the original IOMMU - ie there is only one set
> of IOMMU capabilities in a /dev/ioasid.

Or could we still have just one /dev/ioasid but allow userspace to create
multiple gpa_ioasid_id's each associated to a different iommu domain?
Then the compatibility check will be done at ATTACH_IOASID instead of
JOIN_IOASID_FD.

This does impose one burden to userspace though, to understand the
IOMMU compatibilities and figure out which incompatible features may
affect the page table management (while such knowledge is IOMMU
vendor specific) and then explicitly manage multiple /dev/ioasid's or
multiple gpa_ioasid_id's.

Alternatively is it a good design by having the kernel return error at
attach/join time to indicate that incompatibility is detected then the
userspace should open a new /dev/ioasid or creates a new gpa_ioasid_id
for the failing device upon such failure, w/o constructing its own
compatibility knowledge?

>
> This means qemue might have multiple /dev/ioasid's if the system has
> multiple incompatible IOMMUs (is this actually a thing?) The platform

One example is Intel platform with igd. Typically there is one IOMMU
dedicated for igd and the other IOMMU serving all the remaining devices.
The igd IOMMU may not support IOMMU_CACHE while the other one
does.

> should design its IOMMU domains to minimize the number of
> /dev/ioasid's required.
>
> Is there a reason we need to share IOASID'd between completely
> divergance IOMMU implementations? I don't expect the HW should be able
> to physically share page tables??

yes, e.g. in vSVA both devices (behind divergence IOMMUs) are bound
to a single guest process which has an unique PASID and 1st-level page
table. Earlier incompatibility example is only for 2nd-level.

>
> That decision point alone might be the thing that just says we can't
> ever have /dev/vfio/vfio == /dev/ioasid

yes, unless we adopt the vfio scheme, i.e. implicitly managing incompatible
iommu domains in /dev/ioasid.

>
> > Just to confirm. Above flow is for current map/unmap flavor as what
> > VFIO/vDPA do today. Later when nested translation is supported,
> > there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
> > be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:
>
> Sure.. The tricky bit will be to define both of the common nested
> operating modes.
>
> nested_ioasid = ioctl(ioasid_fd, CREATE_NESTED_IOASID, gpa_ioasid_id);
> ioctl(ioasid_fd, SET_NESTED_IOASID_PAGE_TABLES, nested_ioasid, ..)
>
> // IOMMU will match on the device RID, no PASID:
> ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);
>
> // IOMMU will match on the device RID and PASID:
> ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);

I'm a bit confused here why we have both pasid and ioasid notations together.
Why not use nested_ioasid as pasid directly (i.e. every pasid in nested mode
is created by CREATE_NESTED_IOASID)?

Below I list different scenarios for ATTACH_IOASID in my view. Here
vfio_device could be a real PCI function (RID), or a subfunction device
(RID+def_ioasid). The vfio_device could be attached to a gpa_ioasid
(RID in guest view, no nesting), a nested_ioasid (RID in guest view, nesting)
or a nested_ioasid (RID+PASID in guest view, nesting).

// IOMMU will match on the device RID, no nesting, no PASID:
ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid);

// IOMMU will match on the device (RID+def_ioasid), no nesting, no PASID:
ioctl(vfio_subdevice, ATTACH_IOASID, gpa_ioasid);

// IOMMU will match on the device RID, nesting, no PASID:
ioctl(vfio_device, ATTACH_IOASID, nested_ioasid);

// IOMMU will match on the device (RID+def_ioasid), nesting, no PASID:
ioctl(vfio_subdevice, ATTACH_IOASID, nested_ioasid);

// IOMMU will match on the device (RID+nested_ioasid), nesting, PASID:
ioctl(vfio_device, ATTACH_IOASID_PASID, nested_ioasid);

// IOMMU will match on the device (RID+nested_ioasid), nesting, PASID:
ioctl(vfio_subdevice, ATTACH_IOASID_PASID, nested_ioasid);

>
> Notice that ATTACH (or bind, whatever) is always done on the
> vfio_device FD. ATTACH tells the IOMMU HW to link the PCI BDF&PASID to
> a specific page table defined by an IOASID.
>
> I expect we have many flavours of IOASID tables, eg we have normal,
> and 'nested with table controlled by hypervisor'. ARM has 'nested with
> table controlled by guest' right? So like this?
>
> nested_ioasid = ioctl(ioasid_fd, CREATE_DELGATED_IOASID,
> gpa_ioasid_id, <some kind of viommu_id>)
> // PASID now goes to <viommu_id>
> ioctl(vfio_device, ATTACH_IOASID_PASID, pasid, nested_ioasid);
>
> Where <viommu_id> is some internal to the guest handle of the viommu
> page table scoped within gpa_ioasid_id? Like maybe it is GPA of the
> base of the page table?
>
> The guest can't select its own PASIDs without telling the hypervisor,
> right?
>

If the whole PASID table is delegated to the guest in ARM case, the guest
can select its own PASIDs w/o telling the hypervisor. I haven't thought
carefully about a clean way to support this scheme, e.g. if mandating
guest to always allocate PASIDs through hypervisor even in this case
would it make the uAPI simpler...

Thanks
Kevin