Re: [RFC PATCH 3/5] mm/vma: add support for peer to peer to device vma

From: Jerome Glisse
Date: Tue Jan 29 2019 - 18:48:01 EST


On Tue, Jan 29, 2019 at 03:58:45PM -0700, Logan Gunthorpe wrote:
>
>
> On 2019-01-29 2:50 p.m., Jerome Glisse wrote:
> > No this is the non HMM case i am talking about here. Fully ignore HMM
> > in this frame. A GPU driver that do not support or use HMM in anyway
> > has all the properties and requirement i do list above. So all the points
> > i was making are without HMM in the picture whatsoever. I should have
> > posted this a separate patches to avoid this confusion.
> >
> > Regarding your HMM question. You can not map HMM pages, all code path
> > that would try that would trigger a migration back to regular memory
> > and will use the regular memory for CPU access.
> >
>
> I thought this was the whole point of HMM... And eventually it would
> support being able to map the pages through the BAR in cooperation with
> the driver. If not, what's that whole layer for? Why not just have HMM
> handle this situation?

The whole point is to allow to use device memory for range of virtual
address of a process when it does make sense to use device memory for
that range. So they are multiple cases where it does make sense:
[1] - Only the device is accessing the range and they are no CPU access
For instance the program is executing/running a big function on
the GPU and they are not concurrent CPU access, this is very
common in all the existing GPGPU code. In fact AFAICT It is the
most common pattern. So here you can use HMM private or public
memory.
[2] - Both device and CPU access a common range of virtul address
concurrently. In that case if you are on a platform with cache
coherent inter-connect like OpenCAPI or CCIX then you can use
HMM public device memory and have both access the same memory.
You can not use HMM private memory.

So far on x86 we only have PCIE and thus so far on x86 we only have
private HMM device memory that is not accessible by the CPU in any
way.

It does not make that memory useless, far from it. Having only the
device work on the dataset while CPU is either waiting or accessing
something else is very common.


Then HMM is a toolbox, so here are some of the tools:
HMM mirror - helper to mirror process address on to a device
ie this is SVM(Share Virtual Memory)/SVA(Share Virtual Address)
in software

HMM private memory - allow to register device memory with the linux
kernel. The memory is not CPU accessible. The memory is fully manage
by the device driver. What and when to migrate is under the control
of the device driver.

HMM public memory - allow to register device memory with the linux
kernel. The memory must be CPU accessible and cache coherent and
abide by the platform memory model. The memory is fully manage by
the device driver because otherwise it would disrupt the device
driver operation (for instance GPU can also be use for graphics).

Migration - helper to perform migration to and from device memory.
It does not make any decission on itself it just perform all the
steps in the right order and call back into the driver to get the
migration going.

It is up to device driver to implement heuristic and provide userspace
API to control memory migration to and from device memory. For device
private memory on CPU page fault the kernel will force a migration back
to system memory so that the CPU can access the memory. What matter here
is that the memory model of the platform is intact and thus you can
safely use CPU atomic operation or rely on your platform memory model
for your program. Note that long term i would like to define common API
to expose to userspace to manage memory binding to specific device
memory so that we can mix and match multiple device memory into a single
process and define policy too.

Also CPU atomic instruction to PCIE BAR gives _undefined_ results and in
fact on some AMD/Intel platform it leads to weirdness/crash/freeze. So
obviously we can not map PCIE BAR to CPU without breaking the memory
model. More over on PCIE you might not be able to resize the BAR to
expose all the device memory. GPU can have several giga bytes of memory
and not all of them support PCIE bar resize, and sometimes PCIE bar
resize does not work either because of bios/firmware issue or simply
because you are running out of IO space.

So on x86 we are stuck with HMM private memory, i am hopping that some
day in the future we will have CCIX or something similar. But for now
we have to work with what we have.

> And what struct pages are actually going to be backing these VMAs if
> it's not using HMM?

When you have some range of virtual address migrated to HMM private
memory then the CPU pte are special swap entry and they behave just
as if the memory was swapped to disk. So CPU access to those will
fault and trigger a migration back to main memory.

We still want to allow peer to peer to exist when using HMM memory
for a range of virtual address (of a vma that is not an mmap of a
device file) because the peer device do not rely on atomic or on the
platform memory model. In those cases we assume that the importer is
aware of the limitation and is asking access in good faith and thus
we want to allow the exporting device to either allow the peer mapping
(because it has enough BAR address to map) or fall back to main memory.


> > Again HMM has nothing to do here, ignore HMM it does not play any role
> > and it is not involve in anyway here. GPU want to control what object
> > they allow other device to access and object they do not allow. GPU driver
> > _constantly_ invalidate the CPU page table and in fact the CPU page table
> > do not have any valid pte for a vma that is an mmap of GPU device file
> > for most of the vma lifetime. Changing that would highly disrupt and
> > break GPU drivers. They need to control that, they need to control what
> > to do if another device tries to peer map some of their memory. Hence
> > why they need to implement the callback and decide on wether or not they
> > allow the peer mapping or use device memory for it (they can decide to
> > fallback to main memory).
>
> But mapping is an operation of the memory/struct pages behind the VMA;
> not of the VMA itself and I think that's evident by the code in that the
> only way the VMA layer is involved is the fact that you're abusing
> vm_ops by adding new ops there and calling it by other layers.

For GPU driver the vma pte are populated on CPU page fault and they get
clear quickly after. A very usual pattern is:
- CPU write something to the object through the object mapping ie
through a vma. This trigger page fault which call the fault()
callback from vm_operations struct. This populate the page table
for the vma.
- Userspace launch commands on the GPU, first thing kernel do is
clear all CPU page table entry for objects listed in the commands
ie we do not except any further CPU access nor do we want it.

GPU driver have always been geared toward minimizing CPU access to GPU
memory. For object that need to be access by both concurrently we use the
main memory and not the device memory.

So in fact you will almost never have valid pte for an mmap of a GPU
object (done throught the GPU device file). However it does not mean that
we want to block peer to peer to happen. Today the use cases we know for
peer to peer are with GPUDirect (NVidia) or ROCmDMA (AMD) roughly the
same thing. Most common use cases i am aware are:
- RDMA is streaming in input directly into GPU memory avoiding the
need to have a bounce buffer into memory (this save both main
memory and PCIE bandwidth by avoiding RDMA->main then main->GPU).
- RDMA is streaming out result (same idea as streaming in but in
the other direction :))
- RDMA is use to monitor computation progress on the GPU and it
tries to do so with minimal disruption to the GPU. So RDMA would
like to be able to peek into GPU memory to fetch some values
and transmit them over the network.

I believe people would like to have more complex use case, like for
instance having the GPU be able to directly control some RDMA queue
to request data to some other host on the networ, or control some
block device queue to read data from block device directly. I believe
those can be implemented with the API set forward in those patches.

So for those above use cases it is fine to not have valid CPU pte and
only have peer to peer mapping. The CPU is not expected to be involve
and we should not make it a requirement. Hence we should not expect
to have valid pte.


Also another common use case is that GPU driver might leave pte that
points to main memory while the GPU is using device memory for the
object corresponding to the vma those pte are in. Expectation is that
the CPU access are synchronized with the device access through the
API use by the application. Note here we are talking non HMM, non SVM
case ie special object that are allocated through API specific functions
that result in driver ioctl and mmap of device file.


Hopes this helps understand the big picture from GPU driver point of
view :)

Cheers,
Jérôme