Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

From: Oded Gabbay
Date: Fri Sep 18 2020 - 10:12:39 EST

On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
> > >
> > > > The problem with MR is that the API doesn't let us return a new VA. It
> > > > forces us to use the original VA that the Host OS allocated.
> > >
> > > If using the common MR API you'd have to assign a unique linear range
> > > in the single device address map and record both the IOVA and the MMU
> > > VA in the kernel struct.
> > >
> > > Then when submitting work using that MR lkey the kernel will adjust
> > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> > > forwarding to HW.
> > >
> > We can't do that. That will kill the performance. If for every
> > submission I need to modify the packet's contents, the throughput will
> > go downhill.
> You clearly didn't read where I explained there is a fast path and
> slow path expectation.
> > Also, submissions to our RDMA qmans are coupled with submissions to
> > our DMA/Compute QMANs. We can't separate those to different API calls.
> > That will also kill performance and in addition, will prevent us from
> > synchronizing all the engines.
> Not sure I see why this is a problem. I already explained the fast
> device specific path.
> As long as the kernel maintains proper security when it processes
> submissions the driver can allow objects to cross between the two
> domains.
Can you please explain what you mean by "two domains" ?
You mean the RDMA and compute domains ? Or something else ?

What I was trying to say is that I don't want the application to split
its submissions to different system calls.

Currently we perform submissions through the CS_IOCTL that is defined
in our driver. It is a single IOCTL which allows the user to submit
work to all queues, without regard to the underlying engine of each
If I need to split that to different system calls it will have major
implications. I don't even want to start thinking about all the
synchronization at the host (userspace) level that I will need to do.
That's what I meant by saying that you force me to treat my device as
if it were multiple devices. The whole point of our ASIC is to combine
multiple IPs on the same ASIC.

What will happen when we will add a third domain to our device (e.g.
storage, video decoding, encryption engine, whatever). Will I then
need to separate submissions to 3 different system calls ? In 3
different subsystems ? This doesn't scale. And I strongly say that it
will kill the performance of the device. Not because of the driver.
Because of the complications to the user-space.


> Jason