Re: [Lse-tech] [PATCH 1/2] node affine NUMA scheduler

From: Erich Focht (
Date: Sat Sep 21 2002 - 12:32:59 EST

Hi Martin,

thanks for the comments and the testing!

On Saturday 21 September 2002 19:11, Martin J. Bligh wrote:
> > From the below, I'd suggest you're getting pages off the wrong
> > nodes: do_anonymous_page is page zeroing, and rmqueue the buddy
> > allocator. Are you sure the current->node thing is getting set
> > correctly? I'll try backing out your alloc_pages tweaking, and
> > see what happens.

The current->node is most probably wrong for most of the kernel threads,
except for migration_thread and ksoftirqd. But it should be fine for
user processes.

Might also be that the __node_distance matrix which you might use
by default is not optimal for NUMAQ. It is fine for our remote/local
latency ratio of 1.6. Yours is maybe an order of magnitude larger?
Try replacing: 15 -> 50, guess you don't go beyond 4 nodes now...

> OK, well removing that part of the patch gets us back from 28s to
> about 21s (compared to 20s virgin), total user time compared to
> virgin is up from 59s to 62s, user from 191 to 195. So it's still
> a net loss, but not by nearly as much. Are you determining target
> node on fork or exec ? I forget ...

The default is exec(). You can use
to set the node_policy to do initial load_balancing in fork().
Just do "nodpol -P 2" in the shell before starting the job/task.
This is VERY reccomended if you are creating many tasks/threads.
The default behavior is fine for MPI jobs or users starting serial

> Profile is more comparible. Nothing sticks out any more, but maybe
> it just needs some tuning for balance intervals or something.

Hmmm... There are two changes which might lead to lower performance:
1. load_balance() is not inlined any more.
2. pull_task steals only one task at a load_balance() call. It was
maximally imbalance/2 (if I remember correctly).

And of course, there is some real additional overhead due to the
initial load balancing which one feels for short living tasks... So
please try "nodpol -P 2" (and reset to default by "nodpol -P 0".

Did you try the first patch alone? I mean the pooling-only scheduler?


> 153385 total 0.1544
> 91219 default_idle
> 7475 do_anonymous_page
> 4564 page_remove_rmap
> 4167 handle_mm_fault
> 3467 .text.lock.namei
> 2520 page_add_rmap
> 2112 rmqueue
> 1905 .text.lock.dec_and_lock
> 1849 zap_pte_range
> 1668 vm_enough_memory
> 1612 __free_pages_ok
> 1504 file_read_actor
> 1484 find_get_page
> 1381 __generic_copy_from_user
> 1207 do_no_page
> 1066 schedule
> 1050 get_empty_filp
> 1034 link_path_walk

Dr. Erich Focht                                <>
NEC European Supercomputer Systems, European HPC Technology Center
Hessbruehlstr. 21B, 70565 Stuttgart, Germany
phone: +49-711-78055-15                    fax  : +49-711-78055-25

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to More majordomo info at Please read the FAQ at

This archive was generated by hypermail 2b29 : Mon Sep 23 2002 - 22:00:33 EST