Re: New subsystem for acceleration devices
From: Jason Gunthorpe
Date: Mon Aug 08 2022 - 13:46:51 EST
On Sun, Aug 07, 2022 at 09:43:40AM +0300, Oded Gabbay wrote:
> 1. If there is a subsystem which is responsible for creating and
> exposing the device character files, then there should be some code
> that connects between each device driver to that subsystem.
> i.e. There should be functions that each driver should call from its
> probe and release callback functions.
>
> Those functions should take care of the following:
> - Create metadata for the device, the device's minor(s) and the
> driver's ioctls table and driver's callback for file operations (both
> are common for all the driver's devices). Save all that metadata with
> proper locking.
> - Create the device char files themselves and supply file operations
> that will be called per each open/close/mmap/etc.
> - Keep track of all these objects' lifetime in regard to the device
> driver's lifetime, with proper handling for release.
> - Add common handling and entries of sysfs/debugfs for these devices
> with the ability for each device driver to add their own unique
> entries.
>
> 2. I think that you underestimate (due to your experience) the "using
> it properly" part... It is not so easy to do this properly for
> inexperienced kernel people. If we provide all the code I mentioned
> above, the device driver writer doesn't need to be aware of all these
> kernel APIs.
This may be, but it still seems weird to me to justify a subsystem as
"making existing APIs simpler so drivers don't mess them up". It
suggests perhaps we need additional core API helpers?
> > It would be nice to at least identify something that could obviously
> > be common, like some kind of enumeration and metadata kind of stuff
> > (think like ethtool, devlink, rdma tool, nvemctl etc)
> Definitely. I think we can have at least one ioctl that will be common
> to all drivers from the start.
Generally you don't want that as an ioctl because you have to open the
device to execute it, there may be permissions issues for instance -
or if you have a single-open-at-a-time model like VFIO, then it
doesn't work well together.
Usually this would be sysfs or netlink.
> > This makes sense to me, all accelerators need a way to set a memory
> > map, but on the other hand we are doing some very nifty stuff in this
> > area with iommufd and it might be very interesting to just have the
> > accelerator driver link to that API instead of building yet another
> > copy of pin_user_pages() code.. Especially with PASID likely becoming
> > part of any accelerator toolkit.
>
> Here I disagree with you. First of all, there are many relatively
> simple accelerators, especially in edge, where PASID is really not
> relevant.
> Second, even for the more sophisticated PCIe/CXL-based ones, PASID is
> not mandatory and I suspect that it won't be in 100% of those devices.
> But definitely that should be an alternative to the "classic" way of
> handling dma'able memory (pin_user_pages()).
My point was that iommufd can do the pinning for you and dump that
result into a iommu based PASID, or it can do the pinning for you and
allow the driver to translate it into its own page table format eg the
ASID in the habana device.
We don't need to have map/unmap APIs to manage address spaces in every
subsystem.
> Maybe this is something that should be discussed in the kernel summit ?
Maybe, I expect to be at LPC at least
Jason