Re: [RFC v2] /dev/iommu uAPI proposal

From: Alex Williamson
Date: Fri Jul 09 2021 - 17:51:09 EST


Hi Kevin,

A couple first pass comments...

On Fri, 9 Jul 2021 07:48:44 +0000
"Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> 2.2. /dev/vfio device uAPI
> ++++++++++++++++++++++++++
>
> /*
> * Bind a vfio_device to the specified IOMMU fd
> *
> * The user should provide a device cookie when calling this ioctl. The
> * cookie is later used in IOMMU fd for capability query, iotlb invalidation
> * and I/O fault handling.
> *
> * User is not allowed to access the device before the binding operation
> * is completed.
> *
> * Unbind is automatically conducted when device fd is closed.
> *
> * Input parameters:
> * - iommu_fd;
> * - cookie;
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)

I believe this is an ioctl on the device fd, therefore it should be
named VFIO_DEVICE_BIND_IOMMU_FD.

>
>
> /*
> * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
> *
> * Add a new device capability. The presence indicates that the user
> * is allowed to create multiple I/O address spaces on this device. The
> * capability further includes following flags:
> *
> * - PASID_DELEGATED, if clear every vPASID must be registered to
> * the kernel;
> * - PASID_CPU, if set vPASID is allowed to be carried in the CPU
> * instructions (e.g. ENQCMD);
> * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
> *
> * The user must check that all devices with PASID_CPU set have the
> * same setting on PASID_CPU_VIRT. If mismatching, it should enable
> * vPASID only in one category (all set, or all clear).
> *
> * When the user enables vPASID on the device with PASID_CPU_VIRT
> * set, it must enable vPASID CPU translation via kvm fd before attempting
> * to use ENQCMD to submit work items. The command portal is blocked
> * by the kernel until the CPU translation is enabled.
> */
> #define VFIO_DEVICE_INFO_CAP_PASID 5
>
>
> /*
> * Attach a vfio device to the specified IOASID
> *
> * Multiple vfio devices can be attached to the same IOASID, and vice
> * versa.
> *
> * User may optionally provide a "virtual PASID" to mark an I/O page
> * table on this vfio device, if PASID_DELEGATED is not set in device info.
> * Whether the virtual PASID is physically used or converted to another
> * kernel-allocated PASID is a policy in the kernel.
> *
> * Because one device is allowed to bind to multiple IOMMU fd's, the
> * user should provide both iommu_fd and ioasid for this attach operation.
> *
> * Input parameter:
> * - iommu_fd;
> * - ioasid;
> * - flag;
> * - vpasid (if specified);
> *
> * Return: 0 on success, -errno on failure.
> */
> #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 23)
> #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE + 24)

Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID

...
> 3. Sample structures and helper functions
> --------------------------------------------------------
>
> Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
>
> struct iommu_ctx *iommu_ctx_fdget(int fd);
> struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> struct device *device, u64 cookie);
> int iommu_unregister_device(struct iommu_dev *dev);
>
> An iommu_ctx is created for each fd:
>
> struct iommu_ctx {
> // a list of allocated IOASID data's
> struct xarray ioasid_xa;
>
> // a list of registered devices
> struct xarray dev_xa;
> };
>
> Later some group-tracking fields will be also introduced to support
> multi-devices group.
>
> Each registered device is represented by iommu_dev:
>
> struct iommu_dev {
> struct iommu_ctx *ctx;
> // always be the physical device
> struct device *device;
> u64 cookie;
> struct kref kref;
> };
>
> A successful binding establishes a security context for the bound
> device and returns struct iommu_dev pointer to the caller. After this
> point, the user is allowed to query device capabilities via IOMMU_
> DEVICE_GET_INFO.

If we have an initial singleton group only restriction, I assume that
both iommu_register_device() would fail for any devices that are not in
a singleton group and vfio would only expose direct device files for
the devices in singleton groups. The latter implementation could
change when multi-device group support is added so that userspace can
assume that if the vfio device file exists, this interface is available.
I think this is confirmed further below.

> For mdev the struct device should be the pointer to the parent device.

I don't get how iommu_register_device() differentiates an mdev from a
pdev in this case.

...
> 4.3. IOASID nesting (software)
> ++++++++++++++++++++++++++++++
>
> Same usage scenario as 4.2, with software-based IOASID nesting
> available. In this mode it is the kernel instead of user to create the
> shadow mapping.
>
> The flow before guest boots is same as 4.2, except one point. Because
> giova_ioasid is nested on gpa_ioasid, locked accounting is only
> conducted for gpa_ioasid which becomes the only root.
>
> There could be a case where different gpa_ioasids are created due
> to incompatible format between dev1/dev2 (e.g. about IOMMU
> enforce-snoop). In such case the user could further created a dummy
> IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
> duplicated accounting. But this scenario is not covered in following
> flows.

This use case has been noted several times in the proposal, it probably
deserves an example.

>
> To save space we only list the steps after boots (i.e. both dev1/dev2
> have been attached to gpa_ioasid before guest boots):
>
> /* After boots */
> /* Create GIOVA space nested on GPA space
> * Both page tables are managed by the kernel
> */
> alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);

So the user would use IOMMU_DEVICE_GET_INFO on the iommu_fd with device
cookie2 after the VFIO_DEVICE_BIND_IOMMU_FD to learn that software
nesting is supported before proceeding down this path?

>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
> * based on the vIOMMU page table. The kernel is responsible for
> * creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
> * by walking the parent's I/O page table to find out GPA [0x1000] ->
> * HVA [0x40001000].
> */
> dma_map = {
> .ioasid = giova_ioasid;
> .iova = 0x2000; // GIOVA
> .vaddr = 0x1000; // GPA
> .size = 4KB;
> };
> ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
>
> 4.4. IOASID nesting (hardware)
> ++++++++++++++++++++++++++++++
>
> Same usage scenario as 4.2, with hardware-based IOASID nesting
> available. In this mode the I/O page table is managed by userspace
> thus an invalidation interface is used for the user to request iotlb
> invalidation.
>
> /* After boots */
> /* Create GIOVA space nested on GPA space.
> * Claim it's an user-managed I/O page table.
> */
> alloc_data = {
> .user_pgtable = true;
> .parent = gpa_ioasid;
> .addr = giova_pgtable;
> // and format information;
> };
> giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> /* Attach dev2 to the new address space (child)
> * Note dev2 is still attached to gpa_ioasid (parent)
> */
> at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
>
> /* Invalidate IOTLB when required */
> inv_data = {
> .ioasid = giova_ioasid;
> // granular/cache type information
> };
> ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);
>
> /* See 4.6 for I/O page fault handling */
>
> 4.5. Guest SVA (vSVA)
> +++++++++++++++++++++
>
> After boots the guest further creates a GVA address spaces (vpasid1) on
> dev1. Dev2 is not affected (still attached to giova_ioasid).
>
> As explained in section 1.4, the user should check the PASID capability
> exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
> semantics when doing the attaching call:

And this characteristic lives in VFIO_DEVICE_GET_INFO rather than
IOMMU_DEVICE_GET_INFO because this is a characteristic known by the
vfio device driver rather than the system IOMMU, right? Thanks,

Alex