Re: Interacting with coherent memory on external devices

From: Oded Gabbay
Date: Fri Apr 24 2015 - 14:41:58 EST

On 04/23/2015 07:22 PM, Jerome Glisse wrote:
On Thu, Apr 23, 2015 at 09:20:55AM -0500, Christoph Lameter wrote:
On Thu, 23 Apr 2015, Benjamin Herrenschmidt wrote:

There are hooks in glibc where you can replace the memory
management of the apps if you want that.

We don't control the app. Let's say we are doing a plugin for libfoo
which accelerates "foo" using GPUs.

There are numerous examples of malloc implementation that can be used for
apps without modifying the app.

What about share memory pass btw process ? Or mmaped file ? Or
a library that is loaded through dlopen and thus had no way to
control any allocation that happen before it became active ?

Now some other app we have no control on uses libfoo. So pointers
already allocated/mapped, possibly a long time ago, will hit libfoo (or
the plugin) and we need GPUs to churn on the data.

IF the GPU would need to suspend one of its computation thread to wait on
a mapping to be established on demand or so then it looks like the
performance of the parallel threads on a GPU will be significantly
compromised. You would want to do the transfer explicitly in some fashion
that meshes with the concurrent calculation in the GPU. You do not want
stalls while GPU number crunching is ongoing.

You do not understand how GPU works. GPU have a pools of thread, and they
always try to have the pool as big as possible so that when a group of
thread is waiting for some memory access, there are others thread ready
to perform some operation. GPU are about hidding memory latency that's
what they are good at. But they only achieve that when they have more
thread in flight than compute unit. The whole thread scheduling is done
by hardware and barely control by the device driver.

So no having the GPU wait for a page fault is not as dramatic as you
think. If you use GPU as they are intended to use you might even never
notice the pagefault and reach close to the theoritical throughput of
the GPU nonetheless.

The point I'm making is you are arguing against a usage model which has
been repeatedly asked for by large amounts of customer (after all that's
also why HMM exists).

I am still not clear what is the use case for this would be. Who is asking
for this?

Everyone but you ? OpenCL 2.0 specific request it and have several level
of support about transparent address space. The lowest one is the one
implemented today in which application needs to use a special memory

The most advance one imply integration with the kernel in which any
memory (mmaped file, share memory or anonymous memory) can be use by
the GPU and does not need to come from a special allocator.

Everyone in the industry is moving toward the most advance one. That
is the raison d'être of HMM, to provide this functionality on hw
platform that do not have things such as CAPI. Which is x86/arm.

So use case is all application using OpenCL or Cuda. So pretty much
everyone doing GPGPU wants this. I dunno how you can't see that.
Share address space is so much easier. Believe it or not most coders
do not have deep knowledge of how things work and if you can remove
the complexity of different memory allocation and different address
space from them they will be happy.

I second what Jerome said, and add that one of the key features of HSA is the ptr-is-a-ptr scheme, where the applications do *not* need to handle different address spaces. Instead, all the memory is seen as a unified address space.

See slide 6 on the following presentation:


To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxxx For more info on Linux MM,
see: .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at