Re: [PATCH 5/8] sched, numa, mm: Add adaptive NUMA affinity support

From: Ingo Molnar
Date: Tue Nov 13 2012 - 03:19:32 EST



* Christoph Lameter <cl@xxxxxxxxx> wrote:

> > Using this, we can construct two per-task node-vectors,
> > 'S_i' and 'P_i' reflecting the amount of shared and
> > privately used pages of this task respectively. Pages for
> > which two consecutive 'hits' are of the same cpu are assumed
> > private and the others are shared.
>
> The classification is per task? [...]

Yes, exactly - access patterns are fundamentally and physically
per task, as a task can execute only on a single CPU at once. (I
say 'task' instead of 'thread' or 'process' because the new code
makes no distinction between threads and processes.)

The new code maps out inter-task relationships, statistically.

So we are basically able to (statistically) approximate which
task relates to which other task in the system, based on their
memory access patterns alone: using a very compact metric and
matching scheduler rules and a (lazy) memory placement machinery
on the VM side.

Say consider the following 10-task workload, where the scheduler
is able to figure out these relationships:

{ A, B, C, D } dominantly share memory X with each other
{ E, F, G, H } dominantly share memory Y with each other
{ I } uses memory privately
{ J } uses memory privately

and the scheduler rules then try to converge these groups of
tasks ideally.

[ The 'role' and grouping of tasks is not static but sampled and
average based - so if a worker thread changes its role, the
scheduler will adapt placement to that. ]

[ A 'private task' is basically a special case for sharing
memory: if a task only shares memory with itself. Its
placement and spreading is easy. ]

> [...] But most tasks have memory areas that are private and
> other areas where shared accesses occur. Can that be per
> memory area? [...]

Do you mean per vma, and/or per mm?

How would that work? Consider the above case:

- 12x CPU-intense threads and a large piece of shared memory

- 2x 4 threads are using two large shared memory area to
calculate (one area for each group of threads)

- the 4 remaining processes aggregate and sort the results from
the 8 threads, in their own dominantly 'private' working set.

how does per vma or per mm describe that properly? The whole
workload might be just within a single large vma within a JVM.
Or it might be implemented using processes and anonymous shared
memory.

If you look at this from a 'per task access pattern and
inter-task working set relationship' perspective then the
resolution and optimization is natural: the 2x 4 threads should
be grouped together modulo capacity constraints, while the
remaining 4 'private memory' threads should be spread out over
the remaining capacity of the system.

What matters is how tasks relate to each other as they perform
processing, not which APIs the workload uses to create tasks and
memory areas.

The main constraint from a placement optimization complexity POV
is task->CPU placement: for NUMA workloads the main challenge -
and 80% of the code and much of the real meat of the feature -
is to categorize and place tasks properly.

There might be much discussion about PROT_NONE and memory
migration details, but that is because the VM code is 5 times
larger than the scheduler code and due to that there's 5 times
more VM hackers than scheduler hackers ;-)

In reality the main complexity of this problem [the placement
optimization problem portion] is a dominantly CPU/task scheduler
feature, and IMO rather fundamentally so: it's not an
implementation choice but derives from the Von Neumann model of
computing in essence.

And that is why IMO the task based access pattern metric
implementaton is such a good fit in practice as well - and that
is why other approaches struggled getting a hold of the NUMA
problem.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/