Re: [PATCH V3 0/4] Define coherent device memory node

From: Jerome Glisse
Date: Wed Feb 22 2017 - 09:59:38 EST

On Wed, Feb 22, 2017 at 10:29:21AM +0100, Michal Hocko wrote:
> On Tue 21-02-17 18:39:17, Anshuman Khandual wrote:
> > On 02/17/2017 07:02 PM, Mel Gorman wrote:


> [...]
> > These are the reasons which prohibit the use of HMM for coherent
> > addressable device memory purpose.
> >
> [...]
> > (3) Application cannot directly allocate into device memory from user
> > space using existing memory related system calls like mmap() and mbind()
> > as the device memory hides away in ZONE_DEVICE.
> Why cannot the application simply use mmap on the device file?

This has been said before but we want to share the address space this do
imply that you can not rely on special allocator. For instance you can
have an application that use a library and the library use the GPU but
the application is un-aware and those any data provided by the application
to the library will come from generic malloc (mmap anonymous or from
regular file).

Currently what happens is that the library reallocate memory through
special allocator and copy thing. Not only does this waste memory (the
new memory is often regular memory too) but you also have to paid the
cost of copying GB of data.

Last bullet to this, is complex data structure (list, tree, ...) having
to go through special allocator means you have re-build the whole structure
with the duplicated memory.

Allowing to directly use memory allocated from malloc (mmap anonymous
private or from a regular file) avoid the copy operation and the complex
duplication of data structure. Moving the dataset to the GPU is then a
simple memory migration from kernel point of view.

This is share address space without special allocator is mandatory in new
or future standard such as OpenCL, Cuda, C++, OpenMP, ... some other OS
already have this and the industry want it. So the questions is do we
want to support any of this, do we care about GPGPU ?

I believe we want to support all this new standard but maybe i am the
only one.

In HMM case i have the extra painfull fact that the device memory is
not accessible by the CPU. For CDM on contrary, CPU can access in a
cache coherent way the device memory and all operation behave as regular
memory (thing like atomic operation for instance).

I hope this clearly explain why we can no longer rely on dedicated/
specialized memory allocator.