Re: [RFC] /dev/ioasid uAPI proposal

From: David Gibson
Date: Tue Jun 08 2021 - 02:54:53 EST

Next message: Christian König: "Re: linux-next: build failure after merge of the drm-misc tree"
Previous message: David Gibson: "Re: [RFC] /dev/ioasid uAPI proposal"
In reply to: Jason Gunthorpe: "Re: [RFC] /dev/ioasid uAPI proposal"
Next in thread: Paolo Bonzini: "Re: [RFC] /dev/ioasid uAPI proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Jun 03, 2021 at 07:17:23AM +0000, Tian, Kevin wrote:
> > From: David Gibson <david@xxxxxxxxxxxxxxxxxxxxx>
> > Sent: Wednesday, June 2, 2021 2:15 PM
> >
> [...]
> > > An I/O address space takes effect in the IOMMU only after it is attached
> > > to a device. The device in the /dev/ioasid context always refers to a
> > > physical one or 'pdev' (PF or VF).
> >
> > What you mean by "physical" device here isn't really clear - VFs
> > aren't really physical devices, and the PF/VF terminology also doesn't
> > extent to non-PCI devices (which I think we want to consider for the
> > API, even if we're not implemenenting it any time soon).
>
> Yes, it's not very clear, and more in PCI context to simplify the
> description. A "physical" one here means an PCI endpoint function
> which has a unique RID. It's more to differentiate with later mdev/
> subdevice which uses both RID+PASID. Naming is always a hard
> exercise to me... Possibly I'll just use device vs. subdevice in future
> versions.
>
> >
> > Now, it's clear that we can't program things into the IOMMU before
> > attaching a device - we might not even know which IOMMU to use.
>
> yes
>
> > However, I'm not sure if its wise to automatically make the AS "real"
> > as soon as we attach a device:
> >
> > * If we're going to attach a whole bunch of devices, could we (for at
> > least some IOMMU models) end up doing a lot of work which then has
> > to be re-done for each extra device we attach?
>
> which extra work did you specifically refer to? each attach just implies
> writing the base address of the I/O page table to the IOMMU structure
> corresponding to this device (either being a per-device entry, or per
> device+PASID entry).
>
> and generally device attach should not be in a hot path.
>
> >
> > * With kernel managed IO page tables could attaching a second device
> > (at least on some IOMMU models) require some operation which would
> > require discarding those tables? e.g. if the second device somehow
> > forces a different IO page size
>
> Then the attach should fail and the user should create another IOASID
> for the second device.

Couldn't this make things weirdly order dependent though? If device A
has strictly more capabilities than device B, then attaching A then B
will be fine, but B then A will trigger a new ioasid fd.

> > For that reason I wonder if we want some sort of explicit enable or
> > activate call. Device attaches would only be valid before, map or
> > attach pagetable calls would only be valid after.
>
> I'm interested in learning a real example requiring explicit enable...
>
> >
> > > One I/O address space could be attached to multiple devices. In this case,
> > > /dev/ioasid uAPI applies to all attached devices under the specified IOASID.
> > >
> > > Based on the underlying IOMMU capability one device might be allowed
> > > to attach to multiple I/O address spaces, with DMAs accessing them by
> > > carrying different routing information. One of them is the default I/O
> > > address space routed by PCI Requestor ID (RID) or ARM Stream ID. The
> > > remaining are routed by RID + Process Address Space ID (PASID) or
> > > Stream+Substream ID. For simplicity the following context uses RID and
> > > PASID when talking about the routing information for I/O address spaces.
> >
> > I'm not really clear on how this interacts with nested ioasids. Would
> > you generally expect the RID+PASID IOASes to be children of the base
> > RID IOAS, or not?
>
> No. With Intel SIOV both parent/children could be RID+PASID, e.g.
> when one enables vSVA on a mdev.

Hm, ok. I really haven't understood how the PASIDs fit into this
then. I'll try again on v2.

> > If the PASID ASes are children of the RID AS, can we consider this not
> > as the device explicitly attaching to multiple IOASIDs, but instead
> > attaching to the parent IOASID with awareness of the child ones?
> >
> > > Device attachment is initiated through passthrough framework uAPI (use
> > > VFIO for simplicity in following context). VFIO is responsible for identifying
> > > the routing information and registering it to the ioasid driver when calling
> > > ioasid attach helper function. It could be RID if the assigned device is
> > > pdev (PF/VF) or RID+PASID if the device is mediated (mdev). In addition,
> > > user might also provide its view of virtual routing information (vPASID) in
> > > the attach call, e.g. when multiple user-managed I/O address spaces are
> > > attached to the vfio_device. In this case VFIO must figure out whether
> > > vPASID should be directly used (for pdev) or converted to a kernel-
> > > allocated one (pPASID, for mdev) for physical routing (see section 4).
> > >
> > > Device must be bound to an IOASID FD before attach operation can be
> > > conducted. This is also through VFIO uAPI. In this proposal one device
> > > should not be bound to multiple FD's. Not sure about the gain of
> > > allowing it except adding unnecessary complexity. But if others have
> > > different view we can further discuss.
> > >
> > > VFIO must ensure its device composes DMAs with the routing information
> > > attached to the IOASID. For pdev it naturally happens since vPASID is
> > > directly programmed to the device by guest software. For mdev this
> > > implies any guest operation carrying a vPASID on this device must be
> > > trapped into VFIO and then converted to pPASID before sent to the
> > > device. A detail explanation about PASID virtualization policies can be
> > > found in section 4.
> > >
> > > Modern devices may support a scalable workload submission interface
> > > based on PCI DMWr capability, allowing a single work queue to access
> > > multiple I/O address spaces. One example is Intel ENQCMD, having
> > > PASID saved in the CPU MSR and carried in the instruction payload
> > > when sent out to the device. Then a single work queue shared by
> > > multiple processes can compose DMAs carrying different PASIDs.
> >
> > Is the assumption here that the processes share the IOASID FD
> > instance, but not memory?
>
> I didn't get this question

Ok, stepping back, what exactly do you mean by "processes" above? Do
you mean Linux processes, or something else?

> > > When executing ENQCMD in the guest, the CPU MSR includes a vPASID
> > > which, if targeting a mdev, must be converted to pPASID before sent
> > > to the wire. Intel CPU provides a hardware PASID translation capability
> > > for auto-conversion in the fast path. The user is expected to setup the
> > > PASID mapping through KVM uAPI, with information about {vpasid,
> > > ioasid_fd, ioasid}. The ioasid driver provides helper function for KVM
> > > to figure out the actual pPASID given an IOASID.
> > >
> > > With above design /dev/ioasid uAPI is all about I/O address spaces.
> > > It doesn't include any device routing information, which is only
> > > indirectly registered to the ioasid driver through VFIO uAPI. For
> > > example, I/O page fault is always reported to userspace per IOASID,
> > > although it's physically reported per device (RID+PASID). If there is a
> > > need of further relaying this fault into the guest, the user is responsible
> > > of identifying the device attached to this IOASID (randomly pick one if
> > > multiple attached devices) and then generates a per-device virtual I/O
> > > page fault into guest. Similarly the iotlb invalidation uAPI describes the
> > > granularity in the I/O address space (all, or a range), different from the
> > > underlying IOMMU semantics (domain-wide, PASID-wide, range-based).
> > >
> > > I/O page tables routed through PASID are installed in a per-RID PASID
> > > table structure. Some platforms implement the PASID table in the guest
> > > physical space (GPA), expecting it managed by the guest. The guest
> > > PASID table is bound to the IOMMU also by attaching to an IOASID,
> > > representing the per-RID vPASID space.
> >
> > Do we need to consider two management modes here, much as we have for
> > the pagetables themsleves: either kernel managed, in which we have
> > explicit calls to bind a vPASID to a parent PASID, or user managed in
> > which case we register a table in some format.
>
> yes, this is related to PASID virtualization in section 4. And based on
> suggestion from Jason, the vPASID requirement will be reported to
> user space via the per-device reporting interface.
>
> Thanks
> Kevin
>

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Attachment: signature.asc
Description: PGP signature

Next message: Christian König: "Re: linux-next: build failure after merge of the drm-misc tree"
Previous message: David Gibson: "Re: [RFC] /dev/ioasid uAPI proposal"
In reply to: Jason Gunthorpe: "Re: [RFC] /dev/ioasid uAPI proposal"
Next in thread: Paolo Bonzini: "Re: [RFC] /dev/ioasid uAPI proposal"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]