Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver

From: Oded Gabbay
Date: Fri Sep 18 2020 - 10:45:54 EST

On Fri, Sep 18, 2020 at 5:19 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> On Fri, Sep 18, 2020 at 05:12:04PM +0300, Oded Gabbay wrote:
> > On Fri, Sep 18, 2020 at 4:59 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > >
> > > On Fri, Sep 18, 2020 at 04:49:25PM +0300, Oded Gabbay wrote:
> > > > On Fri, Sep 18, 2020 at 4:26 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> > > > >
> > > > > On Fri, Sep 18, 2020 at 04:02:24PM +0300, Oded Gabbay wrote:
> > > > >
> > > > > > The problem with MR is that the API doesn't let us return a new VA. It
> > > > > > forces us to use the original VA that the Host OS allocated.
> > > > >
> > > > > If using the common MR API you'd have to assign a unique linear range
> > > > > in the single device address map and record both the IOVA and the MMU
> > > > > VA in the kernel struct.
> > > > >
> > > > > Then when submitting work using that MR lkey the kernel will adjust
> > > > > the work VA using the equation (WORK_VA - IOVA) + MMU_VA before
> > > > > forwarding to HW.
> > > > >
> > > > We can't do that. That will kill the performance. If for every
> > > > submission I need to modify the packet's contents, the throughput will
> > > > go downhill.
> > >
> > > You clearly didn't read where I explained there is a fast path and
> > > slow path expectation.
> > >
> > > > Also, submissions to our RDMA qmans are coupled with submissions to
> > > > our DMA/Compute QMANs. We can't separate those to different API calls.
> > > > That will also kill performance and in addition, will prevent us from
> > > > synchronizing all the engines.
> > >
> > > Not sure I see why this is a problem. I already explained the fast
> > > device specific path.
> > >
> > > As long as the kernel maintains proper security when it processes
> > > submissions the driver can allow objects to cross between the two
> > > domains.
> > Can you please explain what you mean by "two domains" ?
> > You mean the RDMA and compute domains ? Or something else ?
> Yes
> > What I was trying to say is that I don't want the application to split
> > its submissions to different system calls.
> If you can manage the security then you can cross them. Eg since The
> RDMA PD would be created on top of the /dev/misc char dev then it is
> fine for the /dev/misc char dev to access the RDMA objects as a 'dv
> fast path'.
> But now that you say everything is interconnected, I'm wondering,
> without HW security how do you keep netdev isolated from userspace?
> Can I issue commands to /dev/misc and write to kernel memory (does the
> kernel put any pages into the single MMU?) or corrupt the netdev
> driver operations in any way?
> Jason

No, no, no. Please give me more credit :) btw, our kernel interface
was scrutinized when we upstreamed the driver and it was under review
by the Intel security team.

To explain our security mechanism will require some time. It is
detailed in the driver, but it is hard to understand without some
I wonder where to start...

First of all, we support open, close, mmap and IOCTLs to
/dev/misc/hlX. We don't support read/write system calls.
A user never gets direct access to kernel memory. Only through
standard mmap. The only thing we allow to mmap is a command buffer
(which is used to submit work to certain DMA queues on our device)
and to a memory region we use for "CQ" for the RDMA. That's it.

Any access by the device's engines to the host memory is done via our
device's MMU. Our MMU supports multiple ASIDs - Address Space IDs. The
kernel driver is assigned ASID 0, while the user is assigned ASID 1.
We can support up to 1024 ASIDs, but because we limit the user to have
a single application, we only use ASID 0 and 1.

The above means a user can't program an engine (DMA, NIC, compute) to
access memory he didn't first mapped into our device's MMU. The
mapping is done via one of our IOCTLs and the kernel driver makes sure
(using standard kernel internal APIs) the host memory truly belongs to
the user process. All those mappings are done using ASID 1.

If the driver needs to map kernel pages into the device's MMU, then
this is done using ASID 0. This is how we take care of separation
between kernel memory and user memory.

Each transaction our engines create and is going to the host first
passes through our MMU. The transaction comes with its ASID value.
According to that, the MMU knows which page tables to do the walk on.

Specifically regarding RDMA, the user prepares a WQE on the host
memory in an area which is mapped into our MMU using ASID 1. The user
uses the NIC control IOCTL to give the kernel driver the virtual base
address of the WQ and the driver programs it to the H/W. Then, the
user can submit the WQE by submitting a command buffer to the NIC
QMAN. The command buffer contains a message to the QMAN that tells it
to ring the doorbell of the relevant NIC port. The user can't do it
from userspace.

For regular Ethernet traffice, we don't have any IOCTLs of course. All
Ethernet operations are done via the standard networking subsystem
(sockets, etc.).

There are more details of course. I don't know how much you want me to
go deeper. If you have specific questions I'll be happy to answer.