Re: [RFC 2.6.11-rc2-mm2 0/7] mm: manual page migration -- overview

From: Robin Holt
Date: Mon Feb 14 2005 - 10:36:32 EST


On Sat, Feb 12, 2005 at 12:17:25PM +0100, Andi Kleen wrote:
> Ray Bryant <raybry@xxxxxxx> writes:
> > set of pages associated with a particular process need to be moved.
> > The kernel interface that we are proposing is the following:
> >
> > page_migrate(pid, va_start, va_end, count, old_nodes, new_nodes);
>
> [Only commenting on the interface, haven't read your patches at all]
>
> This is basically mbind() with MPOL_F_STRICT, except that it has a pid
> argument. I assume that's for the benefit of your batch scheduler.
>
> But it's not clear to me how and why the batch scheduler should know about
> virtual addresses of different processes anyways. Walking
> /proc/pid/maps? That's all inherently racy when the process is doing
> mmap in parallel. The only way I can think of to do this would be to
> check for changes in maps after a full move and loop, but then you risk
> livelock.

For our use, the batch scheduler will give an intermediary program a
list of processes and a series of from-to node pairs. That process would
then ensure all the processes are stopped, scan their VMAs to determine
what regions are mapped by more than one process, which are mapped
by additional processes not in the job, and make this system call for
each of the unique ranges in the job to migrate their pages from one
node to the next. I believe Ray is working on a library and a standalone
program to do this from a command line.

>
> And you cannot also just specify va_start=0, va_end=~0UL because that
> would make the node arrays grow infinitely.

Across the job, you could be moving some memory regions multiple times.

>
> Also is there a good use case why the batch scheduler should only
> move individual areas in a process around, not the full process?

Overlapping regions.

>
> I think the only sane way for an external process to move another
> around is to do it for the whole process. For that you wouldn't need
> most of the arguments, but just a simple move_process_vm call,
> or perhaps just a file in /proc where the new node can be written to.

But when you take into consideration multiple processes in a job that
all started from one set of mappings and has since tweaked their
mappings to suite their particular needs, there doesn't appear to
be any way to do it without some form of leg work as described
above.

>
> There may be an argument to do this for individual
> tmpfs/hugetlbfs/sysv shm segments too, but mbind() already supports
> that (just map them from a different process and change the policy there)
>
> For process use you could just do it in mbind() or perhaps
> part of the process policy (move page around when touched by process).

This functionality will be used by the batch scheduler to not only
move the processes memory to a different set of nodes, but also reduce
the memory usage on the old set of nodes. For that reason, you can not
rely on touch as the process that _NEEDS_ the memory moved does not
have control of program flow to ensure that after the SIGCONT is sent
the process will touch all of its address space.

Thanks,
Robin Holt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/