RE: Plan for /dev/ioasid RFC v2

From: Tian, Kevin
Date: Mon Jun 28 2021 - 20:43:39 EST


> From: Alex Williamson <alex.williamson@xxxxxxxxxx>
> Sent: Tuesday, June 29, 2021 6:32 AM
>
> On Mon, 28 Jun 2021 01:09:18 +0000
> "Tian, Kevin" <kevin.tian@xxxxxxxxx> wrote:
>
> > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > Sent: Friday, June 25, 2021 10:36 PM
> > >
> > > On Fri, Jun 25, 2021 at 10:27:18AM +0000, Tian, Kevin wrote:
> > >
> > > > - When receiving the binding call for the 1st device in a group,
> iommu_fd
> > > > calls iommu_group_set_block_dma(group, dev->driver) which does
> > > > several things:
> > >
> > > The whole problem here is trying to match this new world where we want
> > > devices to be in charge of their own IOMMU configuration and the old
> > > world where groups are in charge.
> > >
> > > Inserting the group fd and then calling a device-centric
> > > VFIO_GROUP_GET_DEVICE_FD_NEW doesn't solve this conflict, and isn't
> > > necessary.
> >
> > No, this was not what I meant. There is no group fd required when
> > calling this device-centric interface. I was actually talking about:
> >
> > iommu_group_set_block_dma(dev->group, dev->driver)
> >
> > just because current iommu layer API is group-centric. Whether this
> > should be improved could be next-level thing. Sorry for not making
> > it clear in the first place.
> >
> > > We can always get the group back from the device at any
> > > point in the sequence do to a group wide operation.
> >
> > yes.
> >
> > >
> > > What I saw as the appeal of the sort of idea was to just completely
> > > leave all the difficult multi-device-group scenarios behind on the old
> > > group centric API and then we don't have to deal with them at all, or
> > > least not right away.
> >
> > yes, this is the staged approach that we discussed earlier. and
> > the reason why I refined this proposal about multi-devices group
> > here is because you want to see some confidence along this
> > direction. Thus I expanded your idea and hope to achieve consensus
> > with Alex/Joerg who obviously have not been convinced yet.
> >
> > >
> > > I'd see some progression where iommu_fd only works with 1:1 groups at
> > > the start. Other scenarios continue with the old API.
> >
> > One uAPI open after completing this new sketch. v1 proposed to
> > conduct binding (VFIO_BIND_IOMMU_FD) after device_fd is acquired.
> > With this sketch we need a new VFIO_GROUP_GET_DEVICE_FD_NEW
> > to complete both in one step. I want to get Alex's confirmation whether
> > it sounds good to him, since it's better to unify the uAPI between 1:1
> > group and 1:N group even if we don't support 1:N in the start.
>
> I don't like it. It doesn't make sense to me. You have the
> group-centric world, which must continue to exist and cannot change
> because we cannot break the vfio uapi. We can make extensions, we can
> define a new parallel uapi, we can deprecate the uapi, but in the short
> term, it can't change.
>
> AIUI, the new device-centric model starts with vfio device files that
> can be opened directly. So what then is the purpose of a *GROUP* get
> device fd? Why is a vfio uapi involved in setting a device cookie for
> another subsystem?
>
> I'd expect that /dev/iommu will be used by multiple subsystems. All
> will want to bind devices to address spaces, so shouldn't binding a
> device to an iommufd be an ioctl on the iommufd, ie.
> IOMMU_BIND_VFIO_DEVICE_FD. Maybe we don't even need "VFIO" in there
> and
> the iommufd code can figure it out internally.
>
> You're essentially trying to reduce vfio to the device interface. That
> necessarily implies that ioctls on the container, group, or passed
> through the container to the iommu no longer exist. From my
> perspective, there should ideally be no new vfio ioctls. The user gets
> a limited access vfio device fd and enables full access to the device
> by registering it to the iommufd subsystem (100% this needs to be
> enforced until close() to avoid revoke issues). The user interacts
> exclusively with vfio via the device fd and performs all DMA address
> space related operations through the iommufd.
>
> > > Then maybe groups where all devices use the same IOASID.
> > >
> > > Then 1:N groups if the source device is reliably identifiable, this
> > > requires iommu subystem work to attach domains to sub-group objects -
> > > not sure it is worthwhile.
> > >
> > > But at least we can talk about each step with well thought out patches
> > >
> > > The only thing that needs to be done to get the 1:1 step is to broadly
> > > define how the other two cases will work so we don't get into trouble
> > > and set some way to exclude the problematic cases from even getting to
> > > iommu_fd in the first place.
> > >
> > > For instance if we go ahead and create /dev/vfio/device nodes we could
> > > do this only if the group was 1:1, otherwise the group cdev has to be
> > > used, along with its API.
> >
> > I feel for VFIO possibly we don't need significant change to its uAPI
> > sequence, since it anyway needs to support existing semantics for
> > backward compatibility. With this sketch we can keep vfio container/
> > group by introducing an external iommu type which implies a different
> > GET_DEVICE_FD semantics. /dev/iommu can report a fd-wide capability
> > for whether 1:N group is supported to vfio user.
>
> Ideally vfio would also at least be able to register a type1 IOMMU
> backend through the existing uapi, backed by this iommu code, ie. we'd
> create a new "iommufd" (but without the user visible fd), bind all the
> group devices to it, generating our own device cookies, create a single
> ioasid and attach all the devices to it (all internal). When using the
> compatibility mode, userspace doesn't get device cookies, doesn't get
> an iommufd, they do mappings through the container, where vfio owns the
> cookies and ioasid.
>
> > For new subsystems they can directly create device nodes and rely on
> > iommu fd to manage group isolation, without introducing any group
> > semantics in its uAPI.
>
> Create device nodes, bind them to iommufd, associate cookies, attach
> ioasids, etc. That should be the same for all subsystems, including
> vfio, it's just the magic internal handshake between the device
> subsystem and the iommufd subsystem that changes.
>
> > > > a) Check group viability. A group is viable only when all devices in
> > > > the group are in one of below states:
> > > >
> > > > * driver-less
> > > > * bound to a driver which is same as dev->driver (vfio in this
> case)
> > > > * bound to an otherwise allowed driver (same list as in vfio)
> > >
> > > This really shouldn't use hardwired driver checks. Attached drivers
> > > should generically indicate to the iommu layer that they are safe for
> > > iommu_fd usage by calling some function around probe()
> >
> > good idea.
> >
> > >
> > > Thus a group must contain only iommu_fd safe drivers, or drivers-less
> > > devices before any of it can be used. It is the more general
> > > refactoring of what VFIO is doing.
> > >
> > > > c) The iommu layer also verifies group viability on BUS_NOTIFY_
> > > > BOUND_DRIVER event. BUG_ON if viability is broken while
> > > block_dma
> > > > is set.
> > >
> > > And with this concept of iommu_fd safety being first-class maybe we
> > > can somehow eliminate this gross BUG_ON (and the 100's of lines of
> > > code that are used to create it) by denying probe to non-iommu-safe
> > > drivers, somehow.
> >
> > yes.
> >
> > >
> > > > - Binding other devices in the group to iommu_fd just succeeds since
> > > > the group is already in block_dma.
> > >
> > > I think the rest of this more or less describes the device centric
> > > logic for multi-device groups we've already talked about. I don't
> > > think it benifits from having the group fd
> > >
> >
> > sure. All of this new sketch doesn't have group fd in any iommu fd
> > API. Just try to elaborate a full sketch to sync the base.
> >
> > Alex/Joerg, look forward to your thoughts now. 😊
>
> Some provided. Thanks,
>

Thanks a lot Alex! We'll try to focus on the new device-centric flow
w/o touching existing container/group uAPI. As you said, we need
a brand-new mechanism for all subsystems anyway.

With that I will resume v2 progress based on device-centric concept.
It will be still based on a few new VFIO uAPIs to handle device binding/
attaching, though you prefer to not adding any new VFIO uAPI. This is
relatively a smaller open compared to device-centric vs. group-centric
issue. We can have it continuously discussed in parallel with v2 review.
and I hope v2 can be helpful for closing this open with a clearer
explanation about PASID virtualization. 😊

Thanks
Kevin