Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

From: Kenneth Lee
Date: Thu Nov 15 2018 - 03:49:46 EST


On Wed, Nov 14, 2018 at 06:00:17PM +0200, Leon Romanovsky wrote:
> Date: Wed, 14 Nov 2018 18:00:17 +0200
> From: Leon Romanovsky <leon@xxxxxxxxxx>
> To: Kenneth Lee <nek.in.cn@xxxxxxxxx>
> CC: Tim Sell <timothy.sell@xxxxxxxxxx>, linux-doc@xxxxxxxxxxxxxxx,
> Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>, Zaibo Xu
> <xuzaibo@xxxxxxxxxx>, zhangfei.gao@xxxxxxxxxxx, linuxarm@xxxxxxxxxx,
> haojian.zhuang@xxxxxxxxxx, Christoph Lameter <cl@xxxxxxxxx>, Hao Fang
> <fanghao11@xxxxxxxxxx>, Gavin Schenk <g.schenk@xxxxxxxxxxxx>, RDMA mailing
> list <linux-rdma@xxxxxxxxxxxxxxx>, Zhou Wang <wangzhou1@xxxxxxxxxxxxx>,
> Jason Gunthorpe <jgg@xxxxxxxx>, Doug Ledford <dledford@xxxxxxxxxx>, Uwe
> Kleine-KÃnig <u.kleine-koenig@xxxxxxxxxxxxxx>, David Kershner
> <david.kershner@xxxxxxxxxx>, Johan Hovold <johan@xxxxxxxxxx>, Cyrille
> Pitchen <cyrille.pitchen@xxxxxxxxxxxxxxxxxx>, Sagar Dharia
> <sdharia@xxxxxxxxxxxxxx>, Jens Axboe <axboe@xxxxxxxxx>,
> guodong.xu@xxxxxxxxxx, linux-netdev <netdev@xxxxxxxxxxxxxxx>, Randy Dunlap
> <rdunlap@xxxxxxxxxxxxx>, linux-kernel@xxxxxxxxxxxxxxx, Vinod Koul
> <vkoul@xxxxxxxxxx>, linux-crypto@xxxxxxxxxxxxxxx, Philippe Ombredanne
> <pombredanne@xxxxxxxx>, Sanyog Kale <sanyog.r.kale@xxxxxxxxx>, Kenneth Lee
> <liguozhu@xxxxxxxxxxxxx>, "David S. Miller" <davem@xxxxxxxxxxxxx>,
> linux-accelerators@xxxxxxxxxxxxxxxx
> Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
> User-Agent: Mutt/1.10.1 (2018-07-13)
> Message-ID: <20181114160017.GI3759@xxxxxxxxxxxxxxxxxx>
>
> On Wed, Nov 14, 2018 at 10:58:09AM +0800, Kenneth Lee wrote:
> >
> > å 2018/11/13 äå8:23, Leon Romanovsky åé:
> > > On Mon, Nov 12, 2018 at 03:58:02PM +0800, Kenneth Lee wrote:
> > > > From: Kenneth Lee <liguozhu@xxxxxxxxxxxxx>
> > > >
> > > > WarpDrive is a general accelerator framework for the user application to
> > > > access the hardware without going through the kernel in data path.
> > > >
> > > > The kernel component to provide kernel facility to driver for expose the
> > > > user interface is called uacce. It a short name for
> > > > "Unified/User-space-access-intended Accelerator Framework".
> > > >
> > > > This patch add document to explain how it works.
> > > + RDMA and netdev folks
> > >
> > > Sorry, to be late in the game, I don't see other patches, but from
> > > the description below it seems like you are reinventing RDMA verbs
> > > model. I have hard time to see the differences in the proposed
> > > framework to already implemented in drivers/infiniband/* for the kernel
> > > space and for the https://github.com/linux-rdma/rdma-core/ for the user
> > > space parts.
> >
> > Thanks Leon,
> >
> > Yes, we tried to solve similar problem in RDMA. We also learned a lot from
> > the exist code of RDMA. But we we have to make a new one because we cannot
> > register accelerators such as AI operation, encryption or compression to the
> > RDMA framework:)
>
> Assuming that you did everything right and still failed to use RDMA
> framework, you was supposed to fix it and not to reinvent new exactly
> same one. It is how we develop kernel, by reusing existing code.

Yes, but we don't force other system such as NIC or GPU into RDMA, do we?

I assume you would not agree to register a zip accelerator to infiniband? :)

Further, I don't think it is wise to break an exist system (RDMA) to fulfill a
totally new scenario. The better choice is to let them run in parallel for some
time and try to merge them accordingly.

>
> >
> > Another problem we tried to address is the way to pin the memory for dma
> > operation. The RDMA way to pin the memory cannot avoid the page lost due to
> > copy-on-write operation during the memory is used by the device. This may
> > not be important to RDMA library. But it is important to accelerator.
>
> Such support exists in drivers/infiniband/ from late 2014 and
> it is called ODP (on demand paging).

I reviewed ODP and I think it is a solution bound to infiniband. It is part of
MR semantics and required a infiniband specific hook
(ucontext->invalidate_range()). And the hook requires the device to be able to
stop using the page for a while for the copying. It is ok for infiniband
(actually, only mlx5 uses it). I don't think most accelerators can support
this mode. But WarpDrive works fully on top of IOMMU interface, it has no this
limitation.

>
> >
> > Hope this can help the understanding.
>
> Yes, it helped me a lot.
> Now, I'm more than before convinced that this whole patchset shouldn't
> exist in the first place.

Then maybe you can tell me how I can register my accelerator to the user space?

>
> To be clear, NAK.
>
> Thanks
>
> >
> > Cheers
> >
> > >
> > > Hard NAK from RDMA side.
> > >
> > > Thanks
> > >
> > > > Signed-off-by: Kenneth Lee <liguozhu@xxxxxxxxxxxxx>
> > > > ---
> > > > Documentation/warpdrive/warpdrive.rst | 260 +++++++
> > > > Documentation/warpdrive/wd-arch.svg | 764 ++++++++++++++++++++
> > > > Documentation/warpdrive/wd.svg | 526 ++++++++++++++
> > > > Documentation/warpdrive/wd_q_addr_space.svg | 359 +++++++++
> > > > 4 files changed, 1909 insertions(+)
> > > > create mode 100644 Documentation/warpdrive/warpdrive.rst
> > > > create mode 100644 Documentation/warpdrive/wd-arch.svg
> > > > create mode 100644 Documentation/warpdrive/wd.svg
> > > > create mode 100644 Documentation/warpdrive/wd_q_addr_space.svg
> > > >
> > > > diff --git a/Documentation/warpdrive/warpdrive.rst b/Documentation/warpdrive/warpdrive.rst
> > > > new file mode 100644
> > > > index 000000000000..ef84d3a2d462
> > > > --- /dev/null
> > > > +++ b/Documentation/warpdrive/warpdrive.rst
> > > > @@ -0,0 +1,260 @@
> > > > +Introduction of WarpDrive
> > > > +=========================
> > > > +
> > > > +*WarpDrive* is a general accelerator framework for the user application to
> > > > +access the hardware without going through the kernel in data path.
> > > > +
> > > > +It can be used as the quick channel for accelerators, network adaptors or
> > > > +other hardware for application in user space.
> > > > +
> > > > +This may make some implementation simpler. E.g. you can reuse most of the
> > > > +*netdev* driver in kernel and just share some ring buffer to the user space
> > > > +driver for *DPDK* [4] or *ODP* [5]. Or you can combine the RSA accelerator with
> > > > +the *netdev* in the user space as a https reversed proxy, etc.
> > > > +
> > > > +*WarpDrive* takes the hardware accelerator as a heterogeneous processor which
> > > > +can share particular load from the CPU:
> > > > +
> > > > +.. image:: wd.svg
> > > > + :alt: WarpDrive Concept
> > > > +
> > > > +The virtual concept, queue, is used to manage the requests sent to the
> > > > +accelerator. The application send requests to the queue by writing to some
> > > > +particular address, while the hardware takes the requests directly from the
> > > > +address and send feedback accordingly.
> > > > +
> > > > +The format of the queue may differ from hardware to hardware. But the
> > > > +application need not to make any system call for the communication.
> > > > +
> > > > +*WarpDrive* tries to create a shared virtual address space for all involved
> > > > +accelerators. Within this space, the requests sent to queue can refer to any
> > > > +virtual address, which will be valid to the application and all involved
> > > > +accelerators.
> > > > +
> > > > +The name *WarpDrive* is simply a cool and general name meaning the framework
> > > > +makes the application faster. It includes general user library, kernel
> > > > +management module and drivers for the hardware. In kernel, the management
> > > > +module is called *uacce*, meaning "Unified/User-space-access-intended
> > > > +Accelerator Framework".
> > > > +
> > > > +
> > > > +How does it work
> > > > +================
> > > > +
> > > > +*WarpDrive* uses *mmap* and *IOMMU* to play the trick.
> > > > +
> > > > +*Uacce* creates a chrdev for the device registered to it. A "queue" will be
> > > > +created when the chrdev is opened. The application access the queue by mmap
> > > > +different address region of the queue file.
> > > > +
> > > > +The following figure demonstrated the queue file address space:
> > > > +
> > > > +.. image:: wd_q_addr_space.svg
> > > > + :alt: WarpDrive Queue Address Space
> > > > +
> > > > +The first region of the space, device region, is used for the application to
> > > > +write request or read answer to or from the hardware.
> > > > +
> > > > +Normally, there can be three types of device regions mmio and memory regions.
> > > > +It is recommended to use common memory for request/answer descriptors and use
> > > > +the mmio space for device notification, such as doorbell. But of course, this
> > > > +is all up to the interface designer.
> > > > +
> > > > +There can be two types of device memory regions, kernel-only and user-shared.
> > > > +This will be explained in the "kernel APIs" section.
> > > > +
> > > > +The Static Share Virtual Memory region is necessary only when the device IOMMU
> > > > +does not support "Share Virtual Memory". This will be explained after the
> > > > +*IOMMU* idea.
> > > > +
> > > > +
> > > > +Architecture
> > > > +------------
> > > > +
> > > > +The full *WarpDrive* architecture is represented in the following class
> > > > +diagram:
> > > > +
> > > > +.. image:: wd-arch.svg
> > > > + :alt: WarpDrive Architecture
> > > > +
> > > > +
> > > > +The user API
> > > > +------------
> > > > +
> > > > +We adopt a polling style interface in the user space: ::
> > > > +
> > > > + int wd_request_queue(struct wd_queue *q);
> > > > + void wd_release_queue(struct wd_queue *q);
> > > > +
> > > > + int wd_send(struct wd_queue *q, void *req);
> > > > + int wd_recv(struct wd_queue *q, void **req);
> > > > + int wd_recv_sync(struct wd_queue *q, void **req);
> > > > + void wd_flush(struct wd_queue *q);
> > > > +
> > > > +wd_recv_sync() is a wrapper to its non-sync version. It will trapped into
> > > > +kernel and waits until the queue become available.
> > > > +
> > > > +If the queue do not support SVA/SVM. The following helper function
> > > > +can be used to create Static Virtual Share Memory: ::
> > > > +
> > > > + void *wd_preserve_share_memory(struct wd_queue *q, size_t size);
> > > > +
> > > > +The user API is not mandatory. It is simply a suggestion and hint what the
> > > > +kernel interface is supposed to support.
> > > > +
> > > > +
> > > > +The user driver
> > > > +---------------
> > > > +
> > > > +The queue file mmap space will need a user driver to wrap the communication
> > > > +protocol. *UACCE* provides some attributes in sysfs for the user driver to
> > > > +match the right accelerator accordingly.
> > > > +
> > > > +The *UACCE* device attribute is under the following directory:
> > > > +
> > > > +/sys/class/uacce/<dev-name>/params
> > > > +
> > > > +The following attributes is supported:
> > > > +
> > > > +nr_queue_remained (ro)
> > > > + number of queue remained
> > > > +
> > > > +api_version (ro)
> > > > + a string to identify the queue mmap space format and its version
> > > > +
> > > > +device_attr (ro)
> > > > + attributes of the device, see UACCE_DEV_xxx flag defined in uacce.h
> > > > +
> > > > +numa_node (ro)
> > > > + id of numa node
> > > > +
> > > > +priority (rw)
> > > > + Priority or the device, bigger is higher
> > > > +
> > > > +(This is not yet implemented in RFC version)
> > > > +
> > > > +
> > > > +The kernel API
> > > > +--------------
> > > > +
> > > > +The *uacce* kernel API is defined in uacce.h. If the hardware support SVM/SVA,
> > > > +The driver need only the following API functions: ::
> > > > +
> > > > + int uacce_register(uacce);
> > > > + void uacce_unregister(uacce);
> > > > + void uacce_wake_up(q);
> > > > +
> > > > +*uacce_wake_up* is used to notify the process who epoll() on the queue file.
> > > > +
> > > > +According to the IOMMU capability, *uacce* categories the devices as follow:
> > > > +
> > > > +UACCE_DEV_NOIOMMU
> > > > + The device has no IOMMU. The user process cannot use VA on the hardware
> > > > + This mode is not recommended.
> > > > +
> > > > +UACCE_DEV_SVA (UACCE_DEV_PASID | UACCE_DEV_FAULT_FROM_DEV)
> > > > + The device has IOMMU which can share the same page table with user
> > > > + process
> > > > +
> > > > +UACCE_DEV_SHARE_DOMAIN
> > > > + The device has IOMMU which has no multiple page table and device page
> > > > + fault support
> > > > +
> > > > +If the device works in mode other than UACCE_DEV_NOIOMMU, *uacce* will set its
> > > > +IOMMU to IOMMU_DOMAIN_UNMANAGED. So the driver must not use any kernel
> > > > +DMA API but the following ones from *uacce* instead: ::
> > > > +
> > > > + uacce_dma_map(q, va, size, prot);
> > > > + uacce_dma_unmap(q, va, size, prot);
> > > > +
> > > > +*uacce_dma_map/unmap* is valid only for UACCE_DEV_SVA device. It creates a
> > > > +particular PASID and page table for the kernel in the IOMMU (Not yet
> > > > +implemented in the RFC)
> > > > +
> > > > +For the UACCE_DEV_SHARE_DOMAIN device, uacce_dma_map/unmap is not valid.
> > > > +*Uacce* call back start_queue only when the DUS and DKO region is mmapped. The
> > > > +accelerator driver must use those dma buffer, via uacce_queue->qfrs[], on
> > > > +start_queue call back. The size of the queue file region is defined by
> > > > +uacce->ops->qf_pg_start[].
> > > > +
> > > > +We have to do it this way because most of current IOMMU cannot support the
> > > > +kernel and user virtual address at the same time. So we have to let them both
> > > > +share the same user virtual address space.
> > > > +
> > > > +If the device have to support kernel and user at the same time, both kernel
> > > > +and the user should use these DMA API. This is not convenient. A better
> > > > +solution is to change the future DMA/IOMMU design to let them separate the
> > > > +address space between the user and kernel space. But it is not going to be in
> > > > +a short time.
> > > > +
> > > > +
> > > > +Multiple processes support
> > > > +==========================
> > > > +
> > > > +In the latest mainline kernel (4.19) when this document is written, the IOMMU
> > > > +subsystem do not support multiple process page tables yet.
> > > > +
> > > > +Most IOMMU hardware implementation support multi-process with the concept
> > > > +of PASID. But they may use different name, e.g. it is call sub-stream-id in
> > > > +SMMU of ARM. With PASID or similar design, multi page table can be added to
> > > > +the IOMMU and referred by its PASID.
> > > > +
> > > > +*JPB* has a patchset to enable this[1]_. We have tested it with our hardware
> > > > +(which is known as *D06*). It works well. *WarpDrive* rely on them to support
> > > > +UACCE_DEV_SVA. If it is not enabled, *WarpDrive* can still work. But it
> > > > +support only one process, the device will be set to UACCE_DEV_SHARE_DOMAIN
> > > > +even it is set to UACCE_DEV_SVA initially.
> > > > +
> > > > +Static Share Virtual Memory is mainly used by UACCE_DEV_SHARE_DOMAIN device.
> > > > +
> > > > +
> > > > +Legacy Mode Support
> > > > +===================
> > > > +For the hardware without IOMMU, WarpDrive can still work, the only problem is
> > > > +VA cannot be used in the device. The driver should adopt another strategy for
> > > > +the shared memory. It is only for testing, and not recommended.
> > > > +
> > > > +
> > > > +The Folk Scenario
> > > > +=================
> > > > +For a process with allocated queues and shared memory, what happen if it forks
> > > > +a child?
> > > > +
> > > > +The fd of the queue will be duplicated on folk, so the child can send request
> > > > +to the same queue as its parent. But the requests which is sent from processes
> > > > +except for the one who open the queue will be blocked.
> > > > +
> > > > +It is recommended to add O_CLOEXEC to the queue file.
> > > > +
> > > > +The queue mmap space has a VM_DONTCOPY in its VMA. So the child will lost all
> > > > +those VMAs.
> > > > +
> > > > +This is why *WarpDrive* does not adopt the mode used in *VFIO* and *InfiniBand*.
> > > > +Both solutions can set any user pointer for hardware sharing. But they cannot
> > > > +support fork when the dma is in process. Or the "Copy-On-Write" procedure will
> > > > +make the parent process lost its physical pages.
> > > > +
> > > > +
> > > > +The Sample Code
> > > > +===============
> > > > +There is a sample user land implementation with a simple driver for Hisilicon
> > > > +Hi1620 ZIP Accelerator.
> > > > +
> > > > +To test, do the following in samples/warpdrive (for the case of PC host): ::
> > > > + ./autogen.sh
> > > > + ./conf.sh # or simply ./configure if you build on target system
> > > > + make
> > > > +
> > > > +Then you can get test_hisi_zip in the test subdirectory. Copy it to the target
> > > > +system and make sure the hisi_zip driver is enabled (the major and minor of
> > > > +the uacce chrdev can be gotten from the dmesg or sysfs), and run: ::
> > > > + mknod /dev/ua1 c <major> <minior>
> > > > + test/test_hisi_zip -z < data > data.zip
> > > > + test/test_hisi_zip -g < data > data.gzip
> > > > +
> > > > +
> > > > +References
> > > > +==========
> > > > +.. [1] https://patchwork.kernel.org/patch/10394851/
> > > > +
> > > > +.. vim: tw=78
[...]
> > > > --
> > > > 2.17.1
> > > >