RE: Plan for /dev/ioasid RFC v2
From: Tian, Kevin
Date: Mon Jun 28 2021 - 02:45:35 EST
> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Friday, June 25, 2021 10:36 PM
>
> On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
>
> > - When receiving the binding call for the 1st device in a group, iommu_fd
> > calls iommu_group_set_block_dma(group, dev->driver) which does
> > several things:
>
> The whole problem here is trying to match this new world where we want
> devices to be in charge of their own IOMMU configuration and the old
> world where groups are in charge.
>
> Inserting the group fd and then calling a device-centric
> VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> necessary. We can always get the group back from the device at any
> point in the sequence do to a group wide operation.
>
> What I saw as the appeal of the sort of idea was to just completely
> leave all the difficult multi-device-group scenarios behind on the old
> group centric API and then we don't have to deal with them at all, or
> least not right away.
>
> I'd see some progression where iommu_fd only works with 1:1 groups at
> the start. Other scenarios continue with the old API.
>
> Then maybe groups where all devices use the same IOASID.
>
> Then 1:N groups if the source device is reliably identifiable, this
> requires iommu subystem work to attach domains to sub-group objects -
> not sure it is worthwhile.
>
> But at least we can talk about each step with well thought out patches
>
> The only thing that needs to be done to get the 1:1 step is to broadly
> define how the other two cases will work so we don't get into trouble
> and set some way to exclude the problematic cases from even getting to
> iommu_fd in the first place.
>
> For instance if we go ahead and create /dev/vfio/device nodes we could
> do this only if the group was 1:1, otherwise the group cdev has to be
> used, along with its API.
>
Thinking more along your direction, here is an updated sketch:
[Stage-1]
Multi-devices group (1:N) is handled by existing vfio group fd and
vfio_iommu_type1 driver.
Singleton group (1:1) is handled via a new device-centric protocol:
1) /dev/vfio/device nodes are created for devices in singleton group
or devices w/o group (mdev)
2) user gets iommu_fd by open("/dev/iommu"). A default block_dma
domain is created per iommu_fd (or globally) with an empty I/O
page table.
3) iommu_fd reports that only 1:1 group is supported
4) user gets device_fd by open("/dev/vfio/device"). At this point
mmap() should be blocked since a security context hasn't been
established for this fd. This could be done by returning an error
(EACCESS or EAGAIN?), or succeeding w/o actually setting up the
mapping.
5) user requests to bind device_fd to iommu_fd which verifies the
group is not 1:N (for mdev the check is on the parent device).
Successful binding automatically attaches the device to the block_
dma domain via iommu_attach_group(). From now on the user is
permitted to access the device. If mmap() in 3) is allowed, vfio
actually sets up the MMIO mapping at this point.
6) before the device is unbound from iommu_fd, it is always in a
security context. Attaching/detaching just switches the security
context between the block_dma domain and an ioasid domain.
7) Unbinding detaches the device from the block_dma domain and
re-attach it to the default domain. From now on the user should
be denied from accessing the device. vfio should tear down the
MMIO mapping at this point.
[Stage-2]
Both 1:1 and 1:N groups are handled via the new device-centric protocol.
Old vfio uAPI is kept for legacy applications. All devices in the same group
must share the same I/O address space.
A key difference from stage-1 is the additional check on group viability:
1) vfio creates /dev/vfio/device nodes for all devices
2) Same as stage-1 for getting iommu_fd
3) iommu_fd reports that both 1:1 and 1:N groups are supported
4) Same as stage-1 for getting device_fd
5) when receiving the binding call for the 1st device in a group, iommu
fd does several things:
a) Identify the group of this device and check group viability. A group
is viable only when all devices in the group are in one of below states:
* driver-less
* bound to a driver which is same as the one which does the
binding call (vfio in this case)
* bound to an otherwise allowed driver (which indicates that it
is safe for iommu_fd usage around probe())
b) Attach all devices in the group to the block_dma domain, via existing
iommu_attach_group().
c) Register a notifier callback to verifie group viability on IOMMU_GROUP_
NOTIFY_BOUND_DRIVER event. BUG_ON() might be eliminated if
we can find a way to deny probe of non-iommu-safe drivers.
From now on the user is permitted to access the device. Similar to
stage-1, vfio may set up the MMIO mapping at this point.
6) Binding other devices in the same group just succeed
7) Before the last device in the group is unbound from iommu_fd, all
devices in the group (even not bound to iommu_fd) switch together
between block_dma domain and ioasid domain, initiated by attaching
to or detaching from an ioasid.
a) iommu_fd verifies that all bound devices in the same group must be
attached to a single IOASID.
b) the 1st device attach in the group moves the entire group to use
the new IOASID domain.
c) the last device detach moves the entire group back to the block-dma
domain.
8) A device is allowed to be unbound from iommu_fd when other devices
in the group are still bound. In this case all devices in this group are still
attached to a security context (block-dma or ioasid). vfio may still zap
the mmio mapping (though still in security context) since it doesn't
know group in this new flow. The unbound device should not be bound
to another driver which could break the group viability.
9) When user requests to unbind the last device in the group, iommu_fd
detaches the whole group from the block-dma domain. All mmio mappings
must be zapped immediately. Devices in the group are re-attached to
the default domain from now on (not safe for user to access).
[Stage-3]
It's still an open whether we want to further allow devices within a group
attached to different IOASIDs in case that the source devices are reliably
identifiable. This is an usage not supported by existing vfio and might be
not worthwhile due to improved isolation over time.
When it's required, iommu layer has to create sub-group objects and
expose the sub-group topology to userspace. In the meantime, iommu
API will be extended to allow sub-group attach/detach operations.
In this case, there is no much difference in stage-2 flow. iommu_fd just
needs to understand the sub-group topology when allowing a group of
devices attached to different IOASIDs. The key is still to enforce that
the entire group is in iommu_fd managed security contexts (block-dma or
ioasid) as long as one or more devices in the group are still bound to it.
Thanks
Kevin