RE: [RFC] /dev/ioasid uAPI proposal

From: Tian, Kevin
Date: Tue Jun 01 2021 - 03:02:58 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Saturday, May 29, 2021 4:03 AM
>
> On Thu, May 27, 2021 at 07:58:12AM +0000, Tian, Kevin wrote:
> > /dev/ioasid provides an unified interface for managing I/O page tables for
> > devices assigned to userspace. Device passthrough frameworks (VFIO,
> vDPA,
> > etc.) are expected to use this interface instead of creating their own logic to
> > isolate untrusted device DMAs initiated by userspace.
>
> It is very long, but I think this has turned out quite well. It
> certainly matches the basic sketch I had in my head when we were
> talking about how to create vDPA devices a few years ago.
>
> When you get down to the operations they all seem pretty common sense
> and straightfoward. Create an IOASID. Connect to a device. Fill the
> IOASID with pages somehow. Worry about PASID labeling.
>
> It really is critical to get all the vendor IOMMU people to go over it
> and see how their HW features map into this.
>

Agree. btw I feel it might be good to have several design opens
centrally discussed after going through all the comments. Otherwise
they may be buried in different sub-threads and potentially with
insufficient care (especially for people who haven't completed the
reading).

I summarized five opens here, about:

1) Finalizing the name to replace /dev/ioasid;
2) Whether one device is allowed to bind to multiple IOASID fd's;
3) Carry device information in invalidation/fault reporting uAPI;
4) What should/could be specified when allocating an IOASID;
5) The protocol between vfio group and kvm;

For 1), two alternative names are mentioned: /dev/iommu and
/dev/ioas. I don't have a strong preference and would like to hear
votes from all stakeholders. /dev/iommu is slightly better imho for
two reasons. First, per AMD's presentation in last KVM forum they
implement vIOMMU in hardware thus need to support user-managed
domains. An iommu uAPI notation might make more sense moving
forward. Second, it makes later uAPI naming easier as 'IOASID' can
be always put as an object, e.g. IOMMU_ALLOC_IOASID instead of
IOASID_ALLOC_IOASID. :)

Another naming open is about IOASID (the software handle for ioas)
and the associated hardware ID (PASID or substream ID). Jason thought
PASID is defined more from SVA angle while ARM's convention sounds
clearer from device p.o.v. Following this direction then SID/SSID will be
used to replace RID/PASID in this RFC (and possibly also implying that
the kernel IOASID allocator should also be renamed to SSID allocator).
I don't have better alternative. If no one objects, I'll change to this new
naming in next version.

For 2), Jason prefers to not blocking it if no kernel design reason. If
one device is allowed to bind multiple IOASID fd's, the main problem
is about cross-fd IOASID nesting, e.g. having gpa_ioasid created in fd1
and giova_ioasid created in fd2 and then nesting them together (and
whether any cross-fd notification required when handling invalidation
etc.). We thought that this just adds some complexity while not sure
about the value of supporting it (when one fd can already afford all
discussed usages). Therefore this RFC proposes a device only bound
to at most one IOASID fd. Does this rationale make sense?

To the other end there was also thought whether we should make
a single I/O address space per IOASID fd. This was discussed in previous
thread that #fd's are insufficient to afford theoretical 1M's address
spaces per device. But let's have another revisit and draw a clear
conclusion whether this option is viable.

For 3), Jason/Jean both think it's cleaner to carry device info in the
uAPI. Actually this was one option we developed in earlier internal
versions of this RFC. Later on we changed it to the current way based
on misinterpretation of previous discussion. Thinking more we will
adopt this suggestion in next version, due to both efficiency (I/O page
fault is already a long path ) and security reason (some faults are
unrecoverable thus the faulting device must be identified/isolated).

This implies that VFIO_BOUND_IOASID will be extended to allow user
specify a device label. This label will be recorded in /dev/iommu to
serve per-device invalidation request from and report per-device
fault data to the user. In addition, vPASID (if provided by user) will
be also recorded in /dev/iommu so vPASID<->pPASID conversion
is conducted properly. e.g. invalidation request from user carries
a vPASID which must be converted into pPASID before calling iommu
driver. Vice versa for raw fault data which carries pPASID while the
user expects a vPASID.

For 4), There are two options for specifying the IOASID attributes:

In this RFC, an IOASID has no attribute before it's attached to any
device. After device attach, user queries capability/format info
about the IOMMU which the device belongs to, and then call
different ioctl commands to set the attributes for an IOASID (e.g.
map/unmap, bind/unbind user pgtable, nesting, etc.). This follows
how the underlying iommu-layer API is designed: a domain reports
capability/format info and serves iommu ops only after it's attached
to a device.

Jason suggests having user to specify all attributes about how an
IOASID is expected to work when creating this IOASID. This requires
/dev/iommu to provide capability/format info once a device is bound
to ioasid fd (before creating any IOASID). In concept this should work,
since given a device we can always find its IOMMU. The only gap is
aforementioned: current iommu API is designed per domain instead
of per-device.

Seems to close this design open we have to touch the kAPI design. and
Joerg's input is highly appreciated here.

For 5), I'd expect Alex to chime in. Per my understanding looks the
original purpose of this protocol is not about I/O address space. It's
for KVM to know whether any device is assigned to this VM and then
do something special (e.g. posted interrupt, EPT cache attribute, etc.).
Because KVM deduces some policy based on the fact of assigned device,
it needs to hold a reference to related vfio group. this part is irrelevant
to this RFC.

But ARM's VMID usage is related to I/O address space thus needs some
consideration. Another strange thing is about PPC. Looks it also leverages
this protocol to do iommu group attach: kvm_spapr_tce_attach_iommu_
group. I don't know why it's done through KVM instead of VFIO uAPI in
the first place.

Thanks
Kevin