Re: NUMA API - wish list

From: Andi Kleen
Date: Mon May 03 2004 - 08:19:03 EST


Zoltan Menyhart <Zoltan.Menyhart_AT_bull.net@xxxxxxxxxx> writes:

> The work load manager / load balancer can negotiate other resource
> assignment at any time with the application.
> The work load manager / load balancer is free to move a collection of
> resources from some NUMA domains to others, provided the application's
> requirements are still met. (No hard binding.)

IMHO these are hard research topics that will need considerable
more work to be automated, if they will ever work automated at all.
The main problem is that you several conflicting goals: you
want to use all available CPU power, all available memory,
all available memory bandwidth and the best average memory latency.
They all conflict.

First: basically any more advanced automatic schemes will
require to go all the way to a full workload manager
that can move around memory later, because it is near impossible
to get even two of these goals right in advance.

I first tried to develop a NUMA scheduler "homenode scheduler" that
attempted to do a lot of this automatically. I then realized that it
is just too hard to do and it never worked very well. That is why I
changed gears and just started with a simple API to let the user tell
the kernel what he wants.

The advantage of this is that a lot of complexity is avoided;
e.g. the NUMA API avoids any need to move memory around.

Now if somebody comes up with a good design for a workload manager and
does all the experiments needed to validate it then it could be later
added. But defering NUMA optimization efforts until this considerable
task is solved (if it even can be solved) would be a big mistake IMHO.

> Billing is done accordingly :-)
>
> As you do not need to know anything about SCSI LUNs, sector IDs, phy-
> sical memory maps or the other applications when you compile your kernel,
> why should an application care for HW NUMA details ?

There is a big difference between these and NUMA.

LUNs, sectors, physical memory are all hidden for correctness. For
that virtualization is fine, because performance is secondary
after correctness.

But NUMA knowledge is purely for optimization. And for optimization
purposes you want to avoid virtualization layers, because they get
in the way of your optimization efforts.

When a human does NUMA optimization they usually want to work near the
bare hardware. And if your dream of a automatic workload manager ever
worked it would also work on the bare hardware.

-Andi

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/