Re: Enabling peer to peer device transactions for PCIe devices

From: Sagalovitch, Serguei
Date: Wed Nov 23 2016 - 19:43:14 EST


On Wed, Nov 23, 2016 at 02:11:29PM -0700, Logan Gunthorpe wrote:

> Perhaps I am not following what Serguei is asking for, but I
> understood the desire was for a complex GPU allocator that could
> migrate pages between GPU and CPU memory under control of the GPU
> driver, among other things. The desire is for DMA to continue to work
> even after these migrations happen.

The main issue is to how to solve use cases when p2p is
requested/initiated via CPU pointers where such pointers could
point to non-system memory location e.g. VRAM.

It will allow to provide consistent working model for user to deal only
with pointers (HSA, CUDA, OpenCL 2.0 SVM) as well as provide
performance optimization avoiding double-buffering and extra special code
when dealing with PCIe device memory.

Examples are:

- RDMA Network operations. RDMA MRs where registered memory
could be e.g. VRAM. Currently it is solved using so called PeerDirect
interface which is currently out-of-tree and provided as part of OFED.
- File operations (fread/fwrite) when user wants to transfer file data directly
to/from e.g. VRAM


Challenges are:
- Because graphics sub-system must support overcomit (at least each
application/process should independently see all resources) ideally
such memory should be movable without changing CPU pointer value
as well as "paged-out" supporting "page fault" at least on access from
CPU.
- We must co-exist with existing DRM infrastructure, as well as
support sharing VRAM memory between different processes
- We should be able to deal with large allocations: tens, hundreds of
MBs or may be GBs.
- We may have PCIe devices where p2p may not work
- Potentially any GPU memory should be supported including
memory carved out from system RAM (e.g. allocated via
get_free_pages()).


Note:
- In the case of RDMA MRs life-span of "pinning"
(get_user_pages"/put_page) may be defined/controlled by
application not kernel which may be should
treated differently as special case.


Original proposal was to create "struct pages" for VRAM memory
to allow "get_user_pages" to work transparently similar
how it is/was done for "DAX Device" case. Unfortunately
based on my understanding "DAX Device" implementation
deal only with permanently "locked" memory (fixed location)
unrelated to "get_user_pages"/"put_page" scope
which doesn't satisfy requirements for "eviction" / "moving" of
memory keeping CPU address intact.

> The desire is for DMA to continue to work
> even after these migrations happen
At least some kind of mm notifier callback to inform about changing
in location (pre- and post-) similar how it is done for system pages.
My understanding is that It will not solve RDMA MR issue where "lock"
could be during the whole application life but (a) it will not make
RDMA MR case worse (b) should be enough for all other cases for
"get_user_pages"/"put_page" controlled by kernel.