[PATCH] NUMA scheduler 1/2

From: Erich Focht (efocht@ess.nec.de)
Date: Fri Oct 25 2002 - 12:37:53 EST


Here come the rediffed (for 2.5.44) patches for my version of the
NUMA scheduler extensions. I'm only sending the first two parts of
the complete set of 5 patches (which make the node affine NUMA scheduler
with dynamic homenode selection). The two patches lead to a pooling
NUMA scheduler with initial load balancing at exec().

The balancing strategy so far is:
- try to balance inside the own node
- when balancing across nodes, try to avoid big differences in node loads.
- when doing an exec(), move the task to the least loaded node.

On a 16 CPU NEC Azusa these two patches lead to roughly a factor of 4
improvement in the "hackbench" test of Rusty (which he now calls
schedbench).

Patch descriptions are as in my previous post:
01-numa_sched_core-2.5.44-10a.patch :
       Provides basic NUMA functionality. It implements CPU pools
       with all the mess needed to initialize them. Also it has a
       node aware find_busiest_queue() which first scans the own
       node for more loaded CPUs. If no steal candidate is found on
       the own node, it finds the most loaded node and tries to steal
       a task from it. By steal delays for remote node steals it
       tries to achieve equal node load. These delays can be extended
       to cope with multi-level node hierarchies (that patch is not
       included).
02-numa_sched_ilb-2.5.44-10.patch :
       This patch provides simple initial load balancing during exec().
       It is node aware and will select the least loaded node. Also it
       does a round-robin initial node selection to distribute the load
       better across the nodes.

The patches should run on ia32 NUMAQ and ia64 Azusa & TX7. Other
architectures just need the build_node() call similar to
arch/i386/kernel/smpboot.c. Be careful to REALLY initialize
cache_decay_ticks (that shouldn't be zero on an SMP machine, anyway).

The first patch provides important infrastructure for any following
NUMA scheduler patches. It introduces CPU pools and a way to loop over
single CPU pools. The pool data initialization is somewhat messy and
sensitive. I'm trying to rewrite it to use RCU, anyway the problem is
that we have to initialize the pool data to something reasonable before
we know how many CPUs will be up and before the cpu_to_node() macro
delivers reasonable numbers. Later on the pool data must be initialized
(and could be changed by CPU hotplug) in a way that goes unnoticed by
the load balancer...

Regards,
Erich



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Oct 31 2002 - 22:00:28 EST