Re: Interacting with coherent memory on external devices

From: Jerome Glisse
Date: Fri Apr 24 2015 - 13:20:12 EST


On Fri, Apr 24, 2015 at 11:58:39AM -0500, Christoph Lameter wrote:
> On Fri, 24 Apr 2015, Jerome Glisse wrote:
>
> > > What exactly is the more advanced version's benefit? What are the features
> > > that the other platforms do not provide?
> >
> > Transparent access to device memory from the CPU, you can map any of the GPU
> > memory inside the CPU and have the whole cache coherency including proper
> > atomic memory operation. CAPI is not some mumbo jumbo marketing name there
> > is real hardware behind it.
>
> Got the hardware here but I am getting pretty sobered given what I heard
> here. The IBM mumbo jumpo marketing comes down to "not much" now.
>
> > On x86 you have to take into account the PCI bar size, you also have to take
> > into account that PCIE transaction are really bad when it comes to sharing
> > memory with CPU. CAPI really improve things here.
>
> Ok that would be interesting for the general device driver case. Can you
> show a real performance benefit here of CAPI transactions vs. PCI-E
> transactions?

I am sure IBM will show benchmark here when they have everything in place. I
am not working on CAPI personnaly, i just went through some of the specification
for it.

> > So on x86 even if you could map all the GPU memory it would still be a bad
> > solution and thing like atomic memory operation might not even work properly.
>
> That is solvable and doable in many other ways if needed. Actually I'd
> prefer a Xeon Phi in that case because then we also have the same
> instruction set. Having locks work right with different instruction sets
> and different coherency schemes. Ewww...
>

Well then go the Xeon Phi solution way and let people that want to provide a
different simpler (from programmer point of view) solution work on it.

>
> > > Then you have the problem of fast memory access and you are proposing to
> > > complicate that access path on the GPU.
> >
> > No, i am proposing to have a solution where people doing such kind of work
> > load can leverage the GPU, yes it will not be as fast as people hand tuning
> > and rewritting their application for the GPU but it will still be faster
> > by a significant factor than only using the CPU.
>
> Well the general purpose processors also also gaining more floating point
> capabilities which increases the pressure on accellerators to become more
> specialized.
>
> > Moreover i am saying that this can happen without even touching a single
> > line of code of many many applications, because many of them rely on library
> > and those are the only one that would need to know about GPU.
>
> Yea. We have heard this numerous times in parallel computing and it never
> really worked right.

Because you had split userspace, a pointer value was not pointing to the same
thing on the GPU as on the CPU so porting library or application is hard and
troublesome. AMD is already working on porting general application or library
to leverage the brave new world of share address space (libreoffice, gimp, ...).

Other people keep presuring for same address space, again this is the corner
stone of OpenCL 2.0.

I can not predict if it will work this time, if all meaning full and usefull
library will start leveraging GPU. All i am trying to do is solve the split
address space problem. Problem that you seem to ignore completely because you
are happy the way things are. Other people are not happy.


>
> > Finaly i am saying that having a unified address space btw the GPU and CPU
> > is a primordial prerequisite for this to happen in a transparent fashion
> > and thus DAX solution is non-sense and does not provide transparent address
> > space sharing. DAX solution is not even something new, this is how today
> > stack is working, no need for DAX, userspace just mmap the device driver
> > file and that's how they access the GPU accessible memory (which in most
> > case is just system memory mapped through the device file to the user
> > application).
>
> Right this is how things work and you could improve on that. Stay with the
> scheme. Why would that not work if you map things the same way in both
> environments if both accellerator and host processor can acceess each
> others memory?

Again and again share address space, having a pointer means the same thing
for the GPU than it means for the CPU ie having a random pointer point to
the same memory whether it is accessed by the GPU or the CPU. While also
keeping the property of the backing memory. It can be share memory from
other process, a file mmaped from disk or simply anonymous memory and
thus we have no control whatsoever on how such memory is allocated.

Then you had transparent migration (transparent in the sense that we can
handle CPU page fault on migrated memory) and you will see that you need
to modify the kernel to become aware of this and provide a common code
to deal with all this.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/