Re: HMM (heterogeneous memory management) v6

From: Jerome Glisse
Date: Wed Nov 12 2014 - 23:28:59 EST


On Wed, Nov 12, 2014 at 05:08:47PM -0600, Christoph Lameter wrote:
> On Wed, 12 Nov 2014, Jerome Glisse wrote:
>
> > > Could we define a new NUMA node that maps memory from the GPU and
> > > then simply use the existing NUMA features to move a process over there.
> >
> > So GPU process will never run on CPU nor will they have a kernel task struct
> > associated with them. From core kernel point of view they do not exist. I
> > hope that at one point down the line the hw will allow for better integration
> > with kernel core but it's not there yet.
>
> Right. So all of this is not relevant because the GPU manages it. You only
> need access from the regular processors from Linux which has and uses Page
> tables.
>
> > So the NUMA idea was considered early on but was discarded as it's not really
> > appropriate. You can have several CPU thread working with several GPU thread
> > at the same time and they can either access disjoint memory or some share
> > memory. Usual case will be few kbytes of share memory for synchronization
> > btw CPU and GPU threads.
>
> It is possible to ahve several threads accessing the memory in Linux. The
> GPU threads run on the gpu and therefore are not a Linux issue. Where did
> you see the problem?

When they both use system memory there is no issue but if you want to leverage
GPU to its full potential you need to migrate memory from system memory to GPU
memory for the duration of the GPU computation (might be several minutes/hours
or more). But at the same time you do not want CPU access to be forbiden thus
if CPU access does happen you want to catch the CPU fault schedule a migration
of GPU memory back to system memory and resume the CPU thread that faulted.

So from CPU point of view this GPU memory is like a swap, the memory is swaped
in the GPU memory and this is exactly how i implemented in, using a special swap
type. Refer to the v1 of my patchset where i show case implementation of most
of the features.

>
> > But when a GPU job is launch we want most of the memory it will use to be
> > migrated to device memory. Issue is that the device memory is not accessible
> > from the CPU (PCIE bar are too small). So there is no way to keep the memory
> > mapped for the CPU. We do need to mark the memory as unaccessible to the CPU
> > and then migrate it to the GPU memory.
>
> Ok so this is transfer issue? Isnt this like block I/O? Write to a device?
>

It can be as slow as block I/O but it's unlike a block device, it's closer to
NUMA in theory because it's just about having memory close to the compute unit
(ie GPU memory in this case) but nothing else beside that match NUMA.

>
> > Now when there is a CPU page fault on some migrated memory we need to migrate
> > memory back to system memory. Hence why i need to tie HMM with some core MM
> > code so that on this kind of fault core kernel knows it needs to call into
> > HMM which will perform housekeeping and starts migration back to system
> > memory.
>
>
> Sounds like a read operation and like a major fault if you would use
> device semantics. You write the pages to the device and then evict them
> from memory (madvise can do that for you). An access then causes a page
> fault which leads to a read operation from the device.

Yes it's a major fault case but we do not want to use this with any special
syscall think existing application that link against library. Now you port
the library to use GPU but application is ignorant of this and thus any CPU
access it does will be through usual mmaped range that did not go through any
special syscall.

>
> > So technicaly there is no task migration only memory migration.
> >
> >
> > Is there something i missing inside NUMA or some NUMA work in progress that
> > change NUMA sufficiently that it might somehow address the use case i am
> > describing above ?
>
> I think you need to be looking at treating GPU memory as a block device
> then you have the semantics you need.

This was explored too but block device does not match what we want. Block device
is nice for file backed memory and we could have special file that would be backed
by GPU memory and process would open those special file and write to it. But this
is not how we want to use this, we do really want to mirror process address space,
ie any kind of existing CPU mapping can be use by GPU (except mmaped IO) and we
want to be able to migrate any of those existing CPU mapping to GPU memory while
still being able to service CPU page fault on range migrated to GPU memory.

So unless there is something i am completely oblivious too in the block device
model in the linux kernel, i fail to see how it could apply to what we want to
achieve.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/