RE: [RFC 02/20] vfio: Add device class for /dev/vfio/devices

From: Tian, Kevin
Date: Wed Sep 22 2021 - 10:10:14 EST


> From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> Sent: Wednesday, September 22, 2021 8:51 PM
>
> On Wed, Sep 22, 2021 at 03:22:42AM +0000, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, September 22, 2021 9:07 AM
> > >
> > > > From: Jason Gunthorpe <jgg@xxxxxxxxxx>
> > > > Sent: Wednesday, September 22, 2021 8:55 AM
> > > >
> > > > On Tue, Sep 21, 2021 at 11:56:06PM +0000, Tian, Kevin wrote:
> > > > > > The opened atomic is aweful. A newly created fd should start in a
> > > > > > state where it has a disabled fops
> > > > > >
> > > > > > The only thing the disabled fops can do is register the device to the
> > > > > > iommu fd. When successfully registered the device gets the normal
> fops.
> > > > > >
> > > > > > The registration steps should be done under a normal lock inside
> the
> > > > > > vfio_device. If a vfio_device is already registered then further
> > > > > > registration should fail.
> > > > > >
> > > > > > Getting the device fd via the group fd triggers the same sequence as
> > > > > > above.
> > > > > >
> > > > >
> > > > > Above works if the group interface is also connected to iommufd, i.e.
> > > > > making vfio type1 as a shim. In this case we can use the registration
> > > > > status as the exclusive switch. But if we keep vfio type1 separate as
> > > > > today, then a new atomic is still necessary. This all depends on how
> > > > > we want to deal with vfio type1 and iommufd, and possibly what's
> > > > > discussed here just adds another pound to the shim option...
> > > >
> > > > No, it works the same either way, the group FD path is identical to
> > > > the normal FD path, it just triggers some of the state transitions
> > > > automatically internally instead of requiring external ioctls.
> > > >
> > > > The device FDs starts disabled, an internal API binds it to the iommu
> > > > via open coding with the group API, and then the rest of the APIs can
> > > > be enabled. Same as today.
> > > >
> >
> > After reading your comments on patch08, I may have a clearer picture
> > on your suggestion. The key is to handle exclusive access at the binding
> > time (based on vdev->iommu_dev). Please see whether below makes
> > sense:
> >
> > Shared sequence:
> >
> > 1) initialize the device with a parked fops;
> > 2) need binding (explicit or implicit) to move away from parked fops;
> > 3) switch to normal fops after successful binding;
> >
> > 1) happens at device probe.
>
> 1 happens when the cdev is setup with the parked fops, yes. I'd say it
> happens at fd open time.
>
> > for nongroup 2) and 3) are done together in VFIO_DEVICE_GET_IOMMUFD:
> >
> > - 2) is done by calling .bind_iommufd() callback;
> > - 3) could be done within .bind_iommufd(), or via a new callback e.g.
> > .finalize_device(). The latter may be preferred for the group interface;
> > - Two threads may open the same device simultaneously, with exclusive
> > access guaranteed by iommufd_bind_device();
> > - Open() after successful binding is rejected, since normal fops has been
> > activated. This is checked upon vdev->iommu_dev;
>
> Almost, open is always successful, what fails is
> VFIO_DEVICE_GET_IOMMUFD (or the group equivilant). The user ends up
> with a FD that is useless, cannot reach the ops and thus cannot impact
> the device it doesn't own in any way.

make sense. I had an wrong impression that once a normal fops is
activated it is also visible to other threads. But in concept this fops
replacement should be local to each thread thus another thread
opening the device always gets a parked fops.

>
> It is similar to opening a group FD
>
> > for group 2/3) are done together in VFIO_GROUP_GET_DEVICE_FD:
> >
> > - 2) is done by open coding bind_iommufd + attach_ioas. Create an
> > iommufd_device object and record it to vdev->iommu_dev
> > - 3) is done by calling .finalize_device();
> > - open() after a valid vdev->iommu_dev is rejected. this also ensures
> > exclusive ownership with the nongroup path.
>
> Same comment as above, groups should go through the same sequence of
> steps, create a FD, attempt to bind, if successuful make the FD
> operational.
>
> The only difference is that failure in these steps does not call
> fd_install(). For this reason alone the FD could start out with
> operational fops, but it feels like a needless optimization.
>
> > If Alex also agrees with it, this might be another mini-series to be merged
> > (just for group path) before this one. Doing so sort of nullifies the existing
> > group/container attaching process, where attach_ioas will be skipped and
> > now the security context is established when the device is opened.
>
> I think it is really important to unify DMA exclusion model and lower
> to the core iommu code. If there is a reason the exclusion must be
> triggered on group fd open then the iommu core code should provide an
> API to do that which interworks with the device API iommufd will work.
>
> But I would start here because it is much simpler to understand..
>

Let's work on this task first and figure out what's the cleaner way to unify
it. My current impression is that having an iommu api for group fd open
might be simpler. Currently vfio iommu drivers are coupled with container
with group-granular operations. Adapting them to device fd open will
require more changes to handle device<->group. anyway we'll see...