Re: Node affine NUMA scheduler

From: Jesse Barnes (jbarnes@sgi.com)
Date: Wed Mar 13 2002 - 21:58:18 EST

Next message: Jean Tourrilhes: "[PATCH 2.4.19-pre3] New wireless driver API part 1"
Previous message: Jeff Garzik: "Re: libc/1427: gprof does not profile threads <synopsis of the problem"
In reply to: Erich Focht: "Node affine NUMA scheduler"
Next in thread: Erich Focht: "Re: Node affine NUMA scheduler"
Reply: Erich Focht: "Re: Node affine NUMA scheduler"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

I ran hackbench on a 16 way system with the new version of your patch,
here are the results:

# vanilla ingo ingo+erich_numa
-- ------- ---- ---------------
1 0.476 0.224 0.320
10 6.503 1.467 1.591
20 14.497 2.811 2.776
30 23.806 4.000 4.016
40 34.472 5.036 5.251
50 48.402 6.552 6.494
60 59.656 7.861 8.032
70 75.220 9.762 9.564
80 87.675 10.791 10.669
90 106.264 15.429 13.697
100 123.883 17.212 15.814

I only did one run for each kernel. Do these numbers compare to what
you're getting? The macros for SN boxes are as follows (if I got them
correct), maybe you could include them in your next patch?

/*
* SGI SN1 specific macros
*/
#elif defined(CONFIG_IA64_SGI_SN1)
#define NR_NODES 4
#define CPU_TO_NODE(cpu) ((cpu_physical_id(cpu) >> 8) & 0xff)
#define SAPICID_TO_NODE(hwid) ((hwid >> 8) & 0xff)
/*
* SGI SN2 specific macros
*/
#elif defined(CONFIG_IA64_SGI_SN2)
#define NR_NODES 2
#define CPU_TO_NODE(cpu) ((cpu_physical_id(cpu) >> 12) & 0xff)
#define SAPICID_TO_NODE(hwid) ((hwid >> 12) & 0xff)

Thanks,
Jesse

On Wed, Mar 13, 2002 at 09:10:07PM +0100, Erich Focht wrote:
> Hi,
>
> here is an update of the NUMA scheduler I'm playing with. It is built on
> top of Ingo's O(1) scheduler, the patch is for the IA-64 port of the
> latest 2.4.17 version of the O(1) scheduler which I posted to the LSE
> mailing list. Once again, please don't regard this as a finished solution,
> it is more a discussion basis.
>
> Here is a description of the background and the implementation.
>
> This scheduler extension is targeted to cc-NUMA machines with CPUs grouped
> in multiple nodes, each node having its own memory. Accessing the memory
> of a remote node implies taking penalties in memory access latency and
> bandwidth. Therefore it is desirable to keep processes on or near the node
> on which their memory (or most of it) is allocated. Fixing the
> cpus_allowed mask of tasks to a particular nodes would of course be a
> solution but experiments show that for loads >100% this leads to poorer
> performance (due to bad load balancing). In this approach I assign each
> task a homenode on which its memory will be allocated (different patch
> needed) and the scheduler will try to keep the task on its homenode or
> attract it to it.
>
> - CPUs are grouped in pools or nodes, the topology is described in the
> macro CPU_TO_NODE and architecture/platform dependent.
>
> - Each task_structure has an additional variable called node.
>
> - The homenode of a task (and its initial CPU) is chosen in do_execve(). I
> switched to this kind of initial balancing instead of do_fork() after some
> discussions. For some cases (multithreaded tasks) this is not always
> optimal and will require some additional treatment.
>
> - The load_balance routine is changed as follows:
> - Tries to balance the load inside the own node when invoked
> (i.e. each tick on idle cpus, every 250ms on busy cpus). This is the
> same behavior as Ingo's design, but restricted to the nearest CPUs
> (the node).
>
> - If the local node is balanced, finds the most loaded node and its
> most loaded CPU.
> - If the local node's load is below the machines average load,
> tries to steal a task immediately from the most loaded CPU.
> - If the local node load is (within some margins) same as the
> machine's average load, remebers the most loaded node and waits
> 100ms before trying to steal a task. As all idle CPUs are racing
> for getting a task from the most loaded node, this gives us a
> chance to get better balance between nodes.
>
> - Tasks not running on their homenode are treated preferrentially by
> the CPUs belonging to the homenode. The load seen by a CPU is
> "subjective" because the tasks originating from its node are counted
> twice. This is basically the force attracting the tasks back to their
> homenode.
>
> - I kept Ingo's timings and margins for balancing:
> - 1ms or 1 tick for idle cpus, 250ms for busy cpus,
> - >= 25% load imbalance necessary.
>
> The patch is tested on a 16 CPU Itanium server (NEC AzusA) and I've seen
> gains of 10-80% in performance, depending on the load. I'd be curious to
> hear whether anybody else is working on node affinity and/or has
> ideas/comments...
>
> Best regards,
> Erich
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jean Tourrilhes: "[PATCH 2.4.19-pre3] New wireless driver API part 1"
Previous message: Jeff Garzik: "Re: libc/1427: gprof does not profile threads <synopsis of the problem"
In reply to: Erich Focht: "Node affine NUMA scheduler"
Next in thread: Erich Focht: "Re: Node affine NUMA scheduler"
Reply: Erich Focht: "Re: Node affine NUMA scheduler"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Fri Mar 15 2002 - 22:00:16 EST