Re: [RFC PATCH 0/8] Qualcomm Cloud AI 100 driver

From: Jason Gunthorpe
Date: Tue May 19 2020 - 19:26:48 EST


On Tue, May 19, 2020 at 10:41:15PM +0200, Daniel Vetter wrote:

> Get some consistency into your decision making as maintainer. And don't
> tell me or anyone else that this is complicated, gpu and rdma driver folks
> very much told you and Olof last year that this is what you're getting
> yourself into.

It is complicated!

One of the big mistakes we learned from in RDMA is that we must have a
cannonical open userspace, that is at least the user side of the uABI
from the kernel. It doesn't have to do a lot but it does have to be
there and everyone must use it.

Some time ago it was all a fragmented mess where every HW had its own
library project with no community and that spilled into the kernel
where it became impossible to be sure everyone was playing nicely and
keeping their parts up to date. We are still digging out where I find
stuff in the kernel that just never seemed to make it into any
userspace..

I feel this is an essential ingredient, and I think I gave this advice
at LPC as well - it is important to start as a proper subsystem with a
proper standard user space. IMHO a random collection of opaque misc
drivers for incredibly complex HW is not going to magically gel into a
subsystem.

Given the state of the industry the userspace doesn't have to do
alot, and maybe that library exposes unique APIs for each HW, but it
is at least a rallying point to handle all these questions like: 'is
the proposed userspace enough?', give some consistency, and be ready
to add in those things that are common (like, say IOMMU PASID setup)

The uacce stuff is sort of interesting here as it does seem to take
some of that approach, it is really simplistic, but the basic idea of
creating a generic DMA work ring is in there, and probably applies
just as well to several of these 'totally-not-a-GPU' drivers.

The other key is that the uABI from the kernel does need to be very
flexible as really any new HW can appear with any new strange need all
the time, and there will not be detailed commonality between HWs. RDMA
has made this mistake a lot in the past too.

The newer RDMA netlink like API is actually turning out not bad for
this purpose.. (again something a subsystem could provide)

Also the approach in this driver to directly connect the device to
userspace for control commands has worked for RDMA in the past few
years.

Jason