RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Thu Jun 03 2021 - 03:17:31 EST

> From: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
> Sent: Wednesday, June 2, 2021 2:15 PM
> > An I/O address space takes effect in the IOMMU only after it is attached
> > to a device. The device in the /dev/ioasid context always refers to a
> > physical one or 'pdev' (PF or VF).
> What you mean by "physical" device here isn't really clear - VFs
> aren't really physical devices, and the PF/VF terminology also doesn't
> extent to non-PCI devices (which I think we want to consider for the
> API, even if we're not implemenenting it any time soon).

Yes, it's not very clear, and more in PCI context to simplify the
description. A "physical" one here means an PCI endpoint function
which has a unique RID. It's more to differentiate with later mdev/
subdevice which uses both RID+PASID. Naming is always a hard
exercise to me... Possibly I'll just use device vs. subdevice in future

> Now, it's clear that we can't program things into the IOMMU before
> attaching a device - we might not even know which IOMMU to use.


> However, I'm not sure if its wise to automatically make the AS "real"
> as soon as we attach a device:
> * If we're going to attach a whole bunch of devices, could we (for at
> least some IOMMU models) end up doing a lot of work which then has
> to be re-done for each extra device we attach?

which extra work did you specifically refer to? each attach just implies
writing the base address of the I/O page table to the IOMMU structure
corresponding to this device (either being a per-device entry, or per
device+PASID entry).

and generally device attach should not be in a hot path.

> * With kernel managed IO page tables could attaching a second device
> (at least on some IOMMU models) require some operation which would
> require discarding those tables? e.g. if the second device somehow
> forces a different IO page size

Then the attach should fail and the user should create another IOASID
for the second device.

> For that reason I wonder if we want some sort of explicit enable or
> activate call. Device attaches would only be valid before, map or
> attach pagetable calls would only be valid after.

I'm interested in learning a real example requiring explicit enable...

> > One I/O address space could be attached to multiple devices. In this case,
> > /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> >
> > Based on the underlying IOMMU capability one device might be allowed
> > to attach to multiple I/O address spaces, with DMAs accessing them by
> > carrying different routing information. One of them is the default I/O
> > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > remaining are routed by RID + Process Address Space ID (PASID) or
> > Stream+Substream ID. For simplicity the following context uses RID and
> > PASID when talking about the routing information for I/O address spaces.
> I'm not really clear on how this interacts with nested ioasids. Would
> you generally expect the RID+PASID IOASes to be children of the base
> RID IOAS, or not?

No. With Intel SIOV both parent/children could be RID+PASID, e.g.
when one enables vSVA on a mdev.

> If the PASID ASes are children of the RID AS, can we consider this not
> as the device explicitly attaching to multiple IOASIDs, but instead
> attaching to the parent IOASID with awareness of the child ones?
> > Device attachment is initiated through passthrough framework uAPI (use
> > VFIO for simplicity in following context). VFIO is responsible for identifying
> > the routing information and registering it to the ioasid driver when calling
> > ioasid attach helper function. It could be RID if the assigned device is
> > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > user might also provide its view of virtual routing information (vPASID) in
> > the attach call, e.g. when multiple user-managed I/O address spaces are
> > attached to the vfio_device. In this case VFIO must figure out whether
> > vPASID should be directly used (for pdev) or converted to a kernel-
> > allocated one (pPASID, for mdev) for physical routing (see section 4).
> >
> > Device must be bound to an IOASID FD before attach operation can be
> > conducted. This is also through VFIO uAPI. In this proposal one device
> > should not be bound to multiple FD's. Not sure about the gain of
> > allowing it except adding unnecessary complexity. But if others have
> > different view we can further discuss.
> >
> > VFIO must ensure its device composes DMAs with the routing information
> > attached to the IOASID. For pdev it naturally happens since vPASID is
> > directly programmed to the device by guest software. For mdev this
> > implies any guest operation carrying a vPASID on this device must be
> > trapped into VFIO and then converted to pPASID before sent to the
> > device. A detail explanation about PASID virtualization policies can be
> > found in section 4.
> >
> > Modern devices may support a scalable workload submission interface
> > based on PCI DMWr capability, allowing a single work queue to access
> > multiple I/O address spaces. One example is Intel ENQCMD, having
> > PASID saved in the CPU MSR and carried in the instruction payload
> > when sent out to the device. Then a single work queue shared by
> > multiple processes can compose DMAs carrying different PASIDs.
> Is the assumption here that the processes share the IOASID FD
> instance, but not memory?

I didn't get this question

> > When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> > which, if targeting a mdev, must be converted to pPASID before sent
> > to the wire. Intel CPU provides a hardware PASID translation capability
> > for auto-conversion in the fast path. The user is expected to setup the
> > PASID mapping through KVM uAPI, with information about {vpasid,
> > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> > to figure out the actual pPASID given an IOASID.
> >
> > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > It doesn't include any device routing information, which is only
> > indirectly registered to the ioasid driver through VFIO uAPI. For
> > example, I/O page fault is always reported to userspace per IOASID,
> > although it's physically reported per device (RID+PASID). If there is a
> > need of further relaying this fault into the guest, the user is responsible
> > of identifying the device attached to this IOASID (randomly pick one if
> > multiple attached devices) and then generates a per-device virtual I/O
> > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > granularity in the I/O address space (all, or a range), different from the
> > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> >
> > I/O page tables routed through PASID are installed in a per-RID PASID
> > table structure. Some platforms implement the PASID table in the guest
> > physical space (GPA), expecting it managed by the guest. The guest
> > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > representing the per-RID vPASID space.
> Do we need to consider two management modes here, much as we have for
> the pagetables themsleves: either kernel managed, in which we have
> explicit calls to bind a vPASID to a parent PASID, or user managed in
> which case we register a table in some format.

yes, this is related to PASID virtualization in section 4. And based on
suggestion from Jason, the vPASID requirement will be reported to
user space via the per-device reporting interface.