Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs

From: Jason Gunthorpe
Date: Wed Apr 28 2021 - 11:00:31 EST


On Wed, Apr 28, 2021 at 10:58:29AM +1000, David Gibson wrote:
> On Tue, Apr 27, 2021 at 02:12:12PM -0300, Jason Gunthorpe wrote:
> > On Tue, Apr 27, 2021 at 03:08:46PM +1000, David Gibson wrote:
> > > > Starting from a BDF the general pseudo code is:
> > > > device_name = first_directory_of("/sys/bus/pci/devices/BDF/vfio/")
> > > > device_fd = open("/dev/vfio/"+device_name)
> > > > ioasidfd = open("/dev/ioasid")
> > > > ioctl(device_fd, JOIN_IOASID_FD, ioasidfd)
> > >
> > > This line is the problem.
> > >
> > > [Historical aside: Alex's early drafts for the VFIO interface looked
> > > quite similar to this. Ben Herrenschmidt and myself persuaded him it
> > > was a bad idea, and groups were developed instead. I still think it's
> > > a bad idea, and not just for POWER]
> >
> > Spawning the VFIO device FD from the group FD is incredibly gross from
> > a kernel design perspective. Since that was done the struct
> > vfio_device missed out on a sysfs presence and doesn't have the
> > typical 'struct device' member or dedicated char device you'd expect a
> > FD based subsystem to have.
> >
> > This basically traded normal usage of the driver core for something
> > that doesn't serve a technical usage. Given we are now nearly 10 years
> > later and see that real widely deployed applications are not doing
> > anything special with the group FD it makes me question the wisdom of
> > this choice.
>
> I'm really not sure what "anything special" would constitute here.

Well, really anything actually. All I see in, say, dpdk, is open the
group fd, get a device fd, do the container dance and never touch the
group fd again or care about groups in any way. It seems typical of
this class of application.

If dpdk is exposing other devices to a risk it certainly hasn't done
anything to make that obvious.

> > Okay, that is fair, but let's solve that problem directly. For
> > instance netlink has been going in the direction of adding a "extack"
> > from the kernel which is a descriptive error string. If the failing
> > ioctl returned the string:
> >
> > "cannot join this device to the IOASID because device XXX in the
> > same group #10 is in use"
>
> Um.. is there a sane way to return strings from an ioctl()?

Yes, it can be done, a string buffer pointer and length in the input
for instance.

> Getting the device fds from the group fd kind of follows, because it's
> unsafe to do basically anything on the device unless you already
> control the group (which in this case means attaching it to a
> container/ioasid). I'm entirely open to ways of doing that that are
> less inelegant from a sysfs integration point of view, but the point
> is you must manage the group before you can do anything at all with
> individual devices.

I think most likely VFIO is going to be the only thing to manage a
multi-device group.

I see things like VDPA being primarily about PASID, and an IOASID that
is matched to a PASID is inherently a single device IOMMU group.

> I don't see why. I mean, sure, you don't want explicitly the *vfio*
> group as such. But IOMMU group is already a cross-subsystem concept
> and you can explicitly expose that in a different way.

Yes, and no, the kernel drivers in something like VDPA have decided
what device and group they are in before we get to IOASID. It is
illogical to try to retro-actively bolt in a group concept to their
APIs.

> Again, I realy think this is necessary complexity. You're right that
> far too little of the userspace properly understands group
> restrictions.. but these come from real hardware limitations, and I
> don't feel like making it *less* obvious in the interface is going to
> help that.

The appeal of making it less obvious is we can have a single
simplified API flow so that an application that doesn't understand or
care about groups can have uniformity.

Jason