RE: [RFC v2] /dev/iommu uAPI proposal

From: Tian, Kevin
Date: Sun Jul 11 2021 - 21:28:39 EST


> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Saturday, July 10, 2021 5:51 AM
>
> Hi Kevin,
>
> A couple first pass comments...
>
> On Fri, 9 Jul 2021 07:48:44 +0000
> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
> > 2.2. /dev/vfio device uAPI
> > ++++++++++++++++++++++++++
> >
> > /*
> > * Bind a vfio_device to the specified IOMMU fd
> > *
> > * The user should provide a device cookie when calling this ioctl. The
> > * cookie is later used in IOMMU fd for capability query, iotlb invalidation
> > * and I/O fault handling.
> > *
> > * User is not allowed to access the device before the binding operation
> > * is completed.
> > *
> > * Unbind is automatically conducted when device fd is closed.
> > *
> > * Input parameters:
> > * - iommu_fd;
> > * - cookie;
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_BIND_IOMMU_FD _IO(VFIO_TYPE, VFIO_BASE + 22)
>
> I believe this is an ioctl on the device fd, therefore it should be
> named VFIO_DEVICE_BIND_IOMMU_FD.

make sense.

>
> >
> >
> > /*
> > * Report vPASID info to userspace via VFIO_DEVICE_GET_INFO
> > *
> > * Add a new device capability. The presence indicates that the user
> > * is allowed to create multiple I/O address spaces on this device. The
> > * capability further includes following flags:
> > *
> > * - PASID_DELEGATED, if clear every vPASID must be registered to
> > * the kernel;
> > * - PASID_CPU, if set vPASID is allowed to be carried in the CPU
> > * instructions (e.g. ENQCMD);
> > * - PASID_CPU_VIRT, if set require vPASID translation in the CPU;
> > *
> > * The user must check that all devices with PASID_CPU set have the
> > * same setting on PASID_CPU_VIRT. If mismatching, it should enable
> > * vPASID only in one category (all set, or all clear).
> > *
> > * When the user enables vPASID on the device with PASID_CPU_VIRT
> > * set, it must enable vPASID CPU translation via kvm fd before attempting
> > * to use ENQCMD to submit work items. The command portal is blocked
> > * by the kernel until the CPU translation is enabled.
> > */
> > #define VFIO_DEVICE_INFO_CAP_PASID 5
> >
> >
> > /*
> > * Attach a vfio device to the specified IOASID
> > *
> > * Multiple vfio devices can be attached to the same IOASID, and vice
> > * versa.
> > *
> > * User may optionally provide a "virtual PASID" to mark an I/O page
> > * table on this vfio device, if PASID_DELEGATED is not set in device info.
> > * Whether the virtual PASID is physically used or converted to another
> > * kernel-allocated PASID is a policy in the kernel.
> > *
> > * Because one device is allowed to bind to multiple IOMMU fd's, the
> > * user should provide both iommu_fd and ioasid for this attach operation.
> > *
> > * Input parameter:
> > * - iommu_fd;
> > * - ioasid;
> > * - flag;
> > * - vpasid (if specified);
> > *
> > * Return: 0 on success, -errno on failure.
> > */
> > #define VFIO_ATTACH_IOASID _IO(VFIO_TYPE, VFIO_BASE +
> 23)
> > #define VFIO_DETACH_IOASID _IO(VFIO_TYPE, VFIO_BASE +
> 24)
>
> Likewise, VFIO_DEVICE_{ATTACH,DETACH}_IOASID
>
> ...
> > 3. Sample structures and helper functions
> > --------------------------------------------------------
> >
> > Three helper functions are provided to support VFIO_BIND_IOMMU_FD:
> >
> > struct iommu_ctx *iommu_ctx_fdget(int fd);
> > struct iommu_dev *iommu_register_device(struct iommu_ctx *ctx,
> > struct device *device, u64 cookie);
> > int iommu_unregister_device(struct iommu_dev *dev);
> >
> > An iommu_ctx is created for each fd:
> >
> > struct iommu_ctx {
> > // a list of allocated IOASID data's
> > struct xarray ioasid_xa;
> >
> > // a list of registered devices
> > struct xarray dev_xa;
> > };
> >
> > Later some group-tracking fields will be also introduced to support
> > multi-devices group.
> >
> > Each registered device is represented by iommu_dev:
> >
> > struct iommu_dev {
> > struct iommu_ctx *ctx;
> > // always be the physical device
> > struct device *device;
> > u64 cookie;
> > struct kref kref;
> > };
> >
> > A successful binding establishes a security context for the bound
> > device and returns struct iommu_dev pointer to the caller. After this
> > point, the user is allowed to query device capabilities via IOMMU_
> > DEVICE_GET_INFO.
>
> If we have an initial singleton group only restriction, I assume that
> both iommu_register_device() would fail for any devices that are not in
> a singleton group and vfio would only expose direct device files for
> the devices in singleton groups. The latter implementation could
> change when multi-device group support is added so that userspace can
> assume that if the vfio device file exists, this interface is available.
> I think this is confirmed further below.

Exactly. Will elaborate this assumption in next version.

>
> > For mdev the struct device should be the pointer to the parent device.
>
> I don't get how iommu_register_device() differentiates an mdev from a
> pdev in this case.

via device cookie.

>
> ...
> > 4.3. IOASID nesting (software)
> > ++++++++++++++++++++++++++++++
> >
> > Same usage scenario as 4.2, with software-based IOASID nesting
> > available. In this mode it is the kernel instead of user to create the
> > shadow mapping.
> >
> > The flow before guest boots is same as 4.2, except one point. Because
> > giova_ioasid is nested on gpa_ioasid, locked accounting is only
> > conducted for gpa_ioasid which becomes the only root.
> >
> > There could be a case where different gpa_ioasids are created due
> > to incompatible format between dev1/dev2 (e.g. about IOMMU
> > enforce-snoop). In such case the user could further created a dummy
> > IOASID (HVA->HVA) as the root parent for two gpa_ioasids to avoid
> > duplicated accounting. But this scenario is not covered in following
> > flows.
>
> This use case has been noted several times in the proposal, it probably
> deserves an example.

will do.

>
> >
> > To save space we only list the steps after boots (i.e. both dev1/dev2
> > have been attached to gpa_ioasid before guest boots):
> >
> > /* After boots */
> > /* Create GIOVA space nested on GPA space
> > * Both page tables are managed by the kernel
> > */
> > alloc_data = {.user_pgtable = false; .parent = gpa_ioasid};
> > giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
>
> So the user would use IOMMU_DEVICE_GET_INFO on the iommu_fd with
> device
> cookie2 after the VFIO_DEVICE_BIND_IOMMU_FD to learn that software
> nesting is supported before proceeding down this path?

yes. If this capability is not available, the user should fall back to the
flow in 4.2, i.e. generating shadow mappings in userspace.

>
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Setup a GIOVA [0x2000] ->GPA [0x1000] mapping for giova_ioasid,
> > * based on the vIOMMU page table. The kernel is responsible for
> > * creating the shadow mapping GIOVA [0x2000] -> HVA [0x40001000]
> > * by walking the parent's I/O page table to find out GPA [0x1000] ->
> > * HVA [0x40001000].
> > */
> > dma_map = {
> > .ioasid = giova_ioasid;
> > .iova = 0x2000; // GIOVA
> > .vaddr = 0x1000; // GPA
> > .size = 4KB;
> > };
> > ioctl(iommu_fd, IOMMU_MAP_DMA, &dma_map);
> >
> > 4.4. IOASID nesting (hardware)
> > ++++++++++++++++++++++++++++++
> >
> > Same usage scenario as 4.2, with hardware-based IOASID nesting
> > available. In this mode the I/O page table is managed by userspace
> > thus an invalidation interface is used for the user to request iotlb
> > invalidation.
> >
> > /* After boots */
> > /* Create GIOVA space nested on GPA space.
> > * Claim it's an user-managed I/O page table.
> > */
> > alloc_data = {
> > .user_pgtable = true;
> > .parent = gpa_ioasid;
> > .addr = giova_pgtable;
> > // and format information;
> > };
> > giova_ioasid = ioctl(iommu_fd, IOMMU_IOASID_ALLOC, &alloc_data);
> >
> > /* Attach dev2 to the new address space (child)
> > * Note dev2 is still attached to gpa_ioasid (parent)
> > */
> > at_data = { .fd = iommu_fd; .ioasid = giova_ioasid};
> > ioctl(device_fd2, VFIO_ATTACH_IOASID, &at_data);
> >
> > /* Invalidate IOTLB when required */
> > inv_data = {
> > .ioasid = giova_ioasid;
> > // granular/cache type information
> > };
> > ioctl(iommu_fd, IOMMU_INVALIDATE_CACHE, &inv_data);
> >
> > /* See 4.6 for I/O page fault handling */
> >
> > 4.5. Guest SVA (vSVA)
> > +++++++++++++++++++++
> >
> > After boots the guest further creates a GVA address spaces (vpasid1) on
> > dev1. Dev2 is not affected (still attached to giova_ioasid).
> >
> > As explained in section 1.4, the user should check the PASID capability
> > exposed via VFIO_DEVICE_GET_INFO and follow the required uAPI
> > semantics when doing the attaching call:
>
> And this characteristic lives in VFIO_DEVICE_GET_INFO rather than
> IOMMU_DEVICE_GET_INFO because this is a characteristic known by the
> vfio device driver rather than the system IOMMU, right? Thanks,
>

yes.

Thanks
Kevin