RE: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

From: Tian, Kevin
Date: Thu Apr 22 2021 - 04:35:08 EST

> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Thursday, April 22, 2021 7:03 AM
> > The pseudo code above really suggests you do want to remove
> > /dev/vfio/vfio, but this is only one of the IOMMU backends for vfio, so
> > I can't quite figure out if we're talking past each other.
> I'm not quite sure what you mean by "one of the IOMMU backends?" You
> mean type1, right?

I think Alex meant that type1 is one of the IOMMU backends in VFIO (type1,
type1v2, tce, tce_v2, noiommu, etc.) which are all configured through
/dev/vfio/vfio. If we are just moving type1 to /dev/ioasid, the justification
is not sufficient by replacing /dev/vfio/vfio with /dev/ioasid, at least in
this transition phase (before all iommu bits are consolidated in /dev/ioasid
in your ideal outcome).

> > As I expressed in another thread, type1 has a lot of shortcomings. The
> > mapping interface leaves userspace trying desperately to use statically
> > mapped buffers because the map/unmap latency is too high. We have
> > horrible issues with duplicate locked page accounting across
> > containers. It suffers pretty hard from feature creep in various
> > areas. A new IOMMU backend is an opportunity to redesign some of these
> > things.
> Sure, but also those kinds of transformational things go alot better
> if you can smoothly go from the old to the new and have technical
> co-existance in side the kernel. Having a shim that maps the old APIs
> to new APIs internally to Linux helps keep the implementation from
> becoming too bogged down with compatibility.

The shim layer could be considered as a new iommu backend in VFIO,
which connects VFIO iommu ops to the internal helpers in drivers/ioasid.
In this case then we don't need to replicate the VFIO uAPI through
/dev/ioasid. Instead the new interface just supports new uAPI. An old
VFIO userspace still opens /dev/vfio/vfio to conduct iommu operations
which implicitly goes to drivers/ioasid. A new VFIO userspace uses
/dev/vfio/vfio to join ioasid_fd and then use new uAPIs through /dev/
ioasid to manage iommu pgtables, as you described below.

> > The IOMMU group also abstracts isolation and visibility relative to
> > DMA. For example, in a PCIe topology a multi-function device may not
> > have isolation between functions, but each requester ID is visible to
> > the IOMMU.
> Okay, I'm glad I have this all right in my head, as I was pretty sure
> this was what the group was about.
> My next question is why do we have three things as a FD: group, device
> and container (aka IOMMU interface)?
> Do we have container because the /dev/vfio/vfio can hold only a single
> page table so we need to swap containers sometimes?

Yes, one container can hold only a single page table. When vIOMMU is
exposed, VFIO requires each device/group in different containers to
support per-device address space (before nested translation is supported),
which is switched between GPA and gIOVA when bypass mode is turned
on/off for a given device.

Another tricky thing is that a container may be linked to multiple iommu
domains in VFIO, as devices in the container may locate behind different
IOMMUs with inconsistent capability (commit 1ef3e2bc). In this case
more accurately one container can hold a single address space, which could
be replayed into multiple page tables (with exact same mappings). I'm not
sure whether this is something that could be simplified (or not supported)
in the new interface. In the end each pgtable operation is per iommu domain
in the iommu layer. I wonder where we want to maintain the relationship
between the ioasid_fd and associated iommu domains.

> If we start from a clean sheet and make a sketch..
> /dev/ioasid is the IOMMU control interface. It can create multiple
> IOASIDs that have page tables and it can manipulate those page tables.
> Each IOASID is identified by some number.
> struct vfio_device/vdpa_device/etc are consumers of /dev/ioasid
> When a device attaches to an ioasid userspace gives VFIO/VDPA the
> ioasid FD and the ioasid # in the FD.
> The security rule for isolation is that once a device is attached to a
> /dev/ioasid fd then all other devices in that security group must be
> attached to the same ioasid FD or left unused.
> Thus /dev/ioasid also becomes the unit of security and the IOMMU
> subsystem level becomes aware of and enforces the group security
> rules. Userspace does not need to "see" the group
> In sketch it would be like
> ioasid_fd = open("/dev/ioasid");
> vfio_device_fd = open("/dev/vfio/device0")
> vdpa_device_fd = open("/dev/vdpa/device0")
> ioctl(vifo_device_fd, JOIN_IOASID_FD, ioasifd)
> ioctl(vdpa_device_fd, JOIN_IOASID_FD, ioasifd)
> gpa_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
> ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> ioctl(vfio_device, ATTACH_IOASID, gpa_ioasid_id)
> ioctl(vpda_device, ATTACH_IOASID, gpa_ioasid_id)
> .. both VDPA and VFIO see the guest physical map and the kernel has
> enough info that both could use the same IOMMU page table
> structure ..
> // Guest viommu turns off bypass mode for the vfio device
> ioctl(vfio_device, DETATCH_IOASID)
> // Guest viommu creates a new page table
> rid_ioasid_id = ioctl(ioasid_fd, CREATE_IOASID, ..)
> ioctl(ioasid_fd, SET_IOASID_PAGE_TABLES, ..)
> // Guest viommu links the new page table to the RID
> ioctl(vfio_device, ATTACH_IOASID, rid_ioasid_id)

Just to confirm. Above flow is for current map/unmap flavor as what
VFIO/vDPA do today. Later when nested translation is supported,
there is no need to detach gpa_ioasid_fd. Instead, a new cmd will
be introduced to nest rid_ioasid_fd on top of gpa_ioasid_fd:

ioctl(ioasid_fd, NEST_IOASIDS, rid_ioasid_id, gpa_ioasid_id);
ioctl(ioasid_fd, BIND_PGTABLE, rid_ioasid_id, ...);

and vSVA will follow the same flow:

ioctl(ioasid_fd, NEST_IOASIDS, sva_ioasid_id, gpa_ioasid_id);
ioctl(ioasid_fd, BIND_PGTABLE, sva_ioasid_id, ...);

Does it match your mind when expanding /dev/ioasid to support
vSVA and other new usages?

> The group security concept becomes implicit and hidden from the
> uAPI. JOIN_IOASID_FD implicitly finds the device's group inside the
> kernel and requires that all members of the group be joined only to
> this ioasid_fd.
> Essentially we discover the group from the device instead of the
> device from the group.
> Where does it fall down compared to the three FD version we have
> today?

I also feel hiding group from uAPI is a good thing and is interested in
the rationale behind for explicitly managing group in vfio (which is
essentially the same boundary as provided by iommu group), e.g. for
better user experience when group security is broken?