Re: [PATCH v3 00/14] Adding GAUDI NIC code to habanalabs driver
From: Oded Gabbay
Date: Fri Sep 18 2020 - 09:02:58 EST
On Fri, Sep 18, 2020 at 3:50 PM Jason Gunthorpe <jgg@xxxxxxxx> wrote:
> On Fri, Sep 18, 2020 at 03:34:54PM +0300, Oded Gabbay wrote:
> > > > Another example is that the submission of WQ is done through our QMAN
> > > > mechanism and is NOT mapped to userspace (due to the restrictions you
> > > > mentioned above and other restrictions).
> > >
> > > Sure, other RDMA drivers also require a kernel ioctl for command
> > > execution.
> > >
> > > In this model the MR can be a software construct, again representing a
> > > security authorization:
> > >
> > > - A 'full process' MR, in which case the kernel command excution
> > > handles dma map and pinning at command execution time
> > > - A 'normal' MR, in which case the DMA list is pre-created and the
> > > command execution just re-uses this data
> > >
> > > The general requirement for RDMA is the same as DRM, you must provide
> > > enough code in rdma-core to show how the device works, and minimally
> > > test it. EFA uses ibv_ud_pingpong, and some pyverbs tests IIRC.
> > >
> > > So you'll want to arrange something where the default MR and PD
> > > mechanisms do something workable on this device, like auto-open the
> > > misc FD when building the PD, and support the 'normal' MR flow for
> > > command execution.
> > I don't know how we can support MR because we can't support any
> > virtual address on the host. Our internal MMU doesn't support 64-bits.
> > We investigated in the past, very much wanted to use IBverbs but
> > didn't figure out how to make it work.
> > I'm adding Itay here and he can also shed more details on that.
> I'm not sure what that means, if the driver intends to DMA from
> process memory then it certainly has a MR concept.
> MRs can control the IOVA directly so if you say the HW needs a MR IOVA
> < 2**32 then that is still OK.
I'll try to explain but please bear with me because it requires some
understanding of our H/W architecture.
Our ASIC has 32 GB of HBM memory (similar to GPUs). The problem is
that HBM memory is accessed by our ASIC's engines (DMA, NIC, etc.)
with physical addressing, which is mapped inside our device between
0x0 to 0x8_0000_0000.
Now, if a user performs malloc and then maps that memory to our device
(using our memory MAP ioctl, similar to how GPU works), it will get a
new virtual address, which is in the range of 0x80_0000_0000 - (2^50
-1). Then, he can use that new VA in our device with different engines
(DMA, NIC, compute).
That way, addresses that represent the host memory do not overlap
addresses that represent HBM memory.
The problem with MR is that the API doesn't let us return a new VA. It
forces us to use the original VA that the Host OS allocated. What will
we do if that VA is in the range of our HBM addresses ? The device
won't be able to distinguish between them. The transaction that is
generated by an engine inside our device will go to the HBM instead of
going to the PCI controller and then to the host.
That's the crust of the problem and why we didn't use MR.
If that's not clear, I'll be happy to explain more.