RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Thu Jun 03 2021 - 04:12:44 EST


> From: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
> Sent: Wednesday, June 2, 2021 2:15 PM
>
[...]

> >
> > /*
> > * Get information about an I/O address space
> > *
> > * Supported capabilities:
> > * - VFIO type1 map/unmap;
> > * - pgtable/pasid_table binding
> > * - hardware nesting vs. software nesting;
> > * - ...
> > *
> > * Related attributes:
> > * - supported page sizes, reserved IOVA ranges (DMA mapping);
>
> Can I request we represent this in terms of permitted IOVA ranges,
> rather than reserved IOVA ranges. This works better with the "window"
> model I have in mind for unifying the restrictions of the POWER IOMMU
> with Type1 like mapping.

Can you elaborate how permitted range work better here?

> > #define IOASID_MAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 6)
> > #define IOASID_UNMAP_DMA _IO(IOASID_TYPE, IOASID_BASE + 7)
>
> I'm assuming these would be expected to fail if a user managed
> pagetable has been bound?

yes. Following Jason's suggestion the format will be specified when
creating an IOASID, thus incompatible cmd will be simply rejected.

> > #define IOASID_BIND_PGTABLE _IO(IOASID_TYPE,
> IOASID_BASE + 9)
> > #define IOASID_UNBIND_PGTABLE _IO(IOASID_TYPE, IOASID_BASE + 10)
>
> I'm assuming that UNBIND would return the IOASID to a kernel-managed
> pagetable?

There will be no UNBIND call in the next version. unbind will be
automatically handled when destroying the IOASID.

>
> For debugging and certain hypervisor edge cases it might be useful to
> have a call to allow userspace to lookup and specific IOVA in a guest
> managed pgtable.

Since all the mapping metadata is from userspace, why would one
rely on the kernel to provide such service? Or are you simply asking
for some debugfs node to dump the I/O page table for a given
IOASID?

>
>
> > /*
> > * Bind an user-managed PASID table to the IOMMU
> > *
> > * This is required for platforms which place PASID table in the GPA space.
> > * In this case the specified IOASID represents the per-RID PASID space.
> > *
> > * Alternatively this may be replaced by IOASID_BIND_PGTABLE plus a
> > * special flag to indicate the difference from normal I/O address spaces.
> > *
> > * The format info of the PASID table is reported in IOASID_GET_INFO.
> > *
> > * As explained in the design section, user-managed I/O page tables must
> > * be explicitly bound to the kernel even on these platforms. It allows
> > * the kernel to uniformly manage I/O address spaces cross all platforms.
> > * Otherwise, the iotlb invalidation and page faulting uAPI must be hacked
> > * to carry device routing information to indirectly mark the hidden I/O
> > * address spaces.
> > *
> > * Input parameters:
> > * - child_ioasid;
>
> Wouldn't this be the parent ioasid, rather than one of the potentially
> many child ioasids?

there is just one child IOASID (per device) for this PASID table

parent ioasid in this case carries the GPA mapping.

> >
> > /*
> > * Invalidate IOTLB for an user-managed I/O page table
> > *
> > * Unlike what's defined in include/uapi/linux/iommu.h, this command
> > * doesn't allow the user to specify cache type and likely support only
> > * two granularities (all, or a specified range) in the I/O address space.
> > *
> > * Physical IOMMU have three cache types (iotlb, dev_iotlb and pasid
> > * cache). If the IOASID represents an I/O address space, the invalidation
> > * always applies to the iotlb (and dev_iotlb if enabled). If the IOASID
> > * represents a vPASID space, then this command applies to the PASID
> > * cache.
> > *
> > * Similarly this command doesn't provide IOMMU-like granularity
> > * info (domain-wide, pasid-wide, range-based), since it's all about the
> > * I/O address space itself. The ioasid driver walks the attached
> > * routing information to match the IOMMU semantics under the
> > * hood.
> > *
> > * Input parameters:
> > * - child_ioasid;
>
> And couldn't this be be any ioasid, not just a child one, depending on
> whether you want PASID scope or RID scope invalidation?

yes, any ioasid could accept invalidation cmd. This was based on
the old assumption that bind+invalidate only applies to child, which
will be fixed in next version.

> > /*
> > * Attach a vfio device to the specified IOASID
> > *
> > * Multiple vfio devices can be attached to the same IOASID, and vice
> > * versa.
> > *
> > * User may optionally provide a "virtual PASID" to mark an I/O page
> > * table on this vfio device. Whether the virtual PASID is physically used
> > * or converted to another kernel-allocated PASID is a policy in vfio device
> > * driver.
> > *
> > * There is no need to specify ioasid_fd in this call due to the assumption
> > * of 1:1 connection between vfio device and the bound fd.
> > *
> > * Input parameter:
> > * - ioasid;
> > * - flag;
> > * - user_pasid (if specified);
>
> Wouldn't the PASID be communicated by whether you give a parent or
> child ioasid, rather than needing an extra value?

No. ioasid is just the software handle.

> > struct ioasid_data {
> > // link to ioasid_ctx->ioasid_list
> > struct list_head next;
> >
> > // the IOASID number
> > u32 ioasid;
> >
> > // the handle to convey iommu operations
> > // hold the pgd (TBD until discussing iommu api)
> > struct iommu_domain *domain;
> >
> > // map metadata (vfio type1 semantics)
> > struct rb_node dma_list;
>
> Why do you need this? Can't you just store the kernel managed
> mappings in the host IO pgtable?

A simple reason is that to implement vfio type1 semantics we
need make sure unmap with size same as what is used for map.
The metadata allows verifying this assumption. Another reason
is when doing software nesting, the page table linked into the
iommu domain is the shadow one. It's better to keep the original
metadata so it can be used to update the shadow when another
level (parent or child) changes the mapping.

> >
> > 5.3. IOASID nesting (software)
> > +++++++++++++++++++++++++
> >
> > Same usage scenario as 5.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
>
> In this case, I feel like the preregistration is redundant with the
> GPA level mapping. As long as the gIOVA mappings (which might be
> frequent) can piggyback on the accounting done for the GPA mapping we
> accomplish what we need from preregistration.

yes, preregistration makes more sense when multiple IOASIDs are
used but are not nested together.

> > 5.5. Guest SVA (vSVA)
> > ++++++++++++++++++
> >
> > After boots the guest further create a GVA address spaces (gpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 4, user should avoid expose ENQCMD on both
> > pdev and mdev.
> >
> > The sequence applies to all device types (being pdev or mdev), except
> > one additional step to call KVM for ENQCMD-capable mdev:
> >
> > /* After boots */
> > /* Make GVA space nested on GPA space */
> > gva_ioasid = ioctl(ioasid_fd, IOASID_CREATE_NESTING,
> > gpa_ioasid);
>
> I'm not clear what gva_ioasid is representing. Is it representing a
> single vPASID's address space, or a whole bunch of vPASIDs address
> spaces?

a single vPASID's address space.

Thanks
Kevin