Re: NUMA scheduler 2nd approach

From: Christoph Hellwig (hch@infradead.org)
Date: Mon Jan 13 2003 - 03:02:08 EST

Next message: Petr.Titera@whitesoft.cz: "Re: Problems with USB"
Previous message: Jan Yenya Kasprzak: "Re: 2.5.56: Two global symbols "io""
In reply to: Erich Focht: "NUMA scheduler 2nd approach"
Next in thread: Erich Focht: "Re: NUMA scheduler 2nd approach"
Reply: Erich Focht: "Re: NUMA scheduler 2nd approach"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jan 13, 2003 at 12:55:28AM +0100, Erich Focht wrote:
> Hi Martin & Michael,
>
> as discussed on the LSE call I played around with a cross-node
> balancer approach put on top of the miniature NUMA scheduler. The
> patches are attached and it seems to be clear that we can regain the
> good performance for hackbench by adding a cross-node balancer.

The changes look fine to me. But I think there's some conding style
issues that need cleaning up (see below). Also is there a reason
patches 2/3 and 4/5 aren't merged into one patch each?

/*
- * find_busiest_queue - find the busiest runqueue.
+ * find_busiest_in_mask - find the busiest runqueue among the cpus in cpumask
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
+static inline runqueue_t *find_busiest_in_mask(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, unsigned long cpumask)

        find_busiest_queue has just one caller in 2.5.56, I'd suggest just
        changing the prototype and updating that single caller to pass in
        the cpumask opencoded.

@@ -160,7 +160,6 @@ extern void update_one_process(struct ta
extern void scheduler_tick(int user_tick, int system);
extern unsigned long cache_decay_ticks;

-
#define MAX_SCHEDULE_TIMEOUT LONG_MAX
extern signed long FASTCALL(schedule_timeout(signed long timeout));
asmlinkage void schedule(void);

I don't think you need this spurious whitespace change :)

+#ifdef CONFIG_NUMA
+extern void sched_balance_exec(void);
+extern void node_nr_running_init(void);
+#define nr_running_inc(rq) atomic_inc(rq->node_ptr); \
+ rq->nr_running++
+#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
+ rq->nr_running--

        static inline void nr_running_inc(runqueue_t *rq)
        {
                atomic_inc(rq->node_ptr);
                rq->nr_running++
        }

etc.. would look a bit nicer.

diff -urNp linux-2.5.55-ms/kernel/sched.c linux-2.5.55-ms-ilb/kernel/sched.c
--- linux-2.5.55-ms/kernel/sched.c 2003-01-10 23:01:02.000000000 +0100
+++ linux-2.5.55-ms-ilb/kernel/sched.c 2003-01-11 01:12:43.000000000 +0100
@@ -153,6 +153,7 @@ struct runqueue {
         task_t *curr, *idle;
         prio_array_t *active, *expired, arrays[2];
         int prev_nr_running[NR_CPUS];
+ atomic_t * node_ptr;

atomic_t *node_ptr; would match the style above.

+#if CONFIG_NUMA
+static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp = {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};

Maybe wants some linewrapping after 80 chars?

+__init void node_nr_running_init(void)
+{
+ int i;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ cpu_rq(i)->node_ptr = &node_nr_running[__cpu_to_node(i)];
+ }
+ return;
+}

        The braces and the return are superflous. Also kernel/sched.c (or
        mingo codein general) seems to prefer array + i instead of &array[i]
        (not that I have a general preferences, but you should try to match
        the surrounding code)

+static void sched_migrate_task(task_t *p, int dest_cpu)
+{
+ unsigned long old_mask;
+
+ old_mask = p->cpus_allowed;
+ if (!(old_mask & (1UL << dest_cpu)))
+ return;
+ /* force the process onto the specified CPU */
+ set_cpus_allowed(p, 1UL << dest_cpu);
+
+ /* restore the cpus allowed mask */
+ set_cpus_allowed(p, old_mask);

This double set_cpus_allowed doesn't look nice to me. I don't
have a better suggestion of-hand, though :(

+#define decl_numa_int(ctr) int ctr

        This is ugly as hell. I'd prefer wasting one int in each runqueue
        or even an ifdef in the struct declaration over this obsfucation
        all the time.

@@ -816,6 +834,16 @@ out:
static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance)
{
unsigned long cpumask = __node_to_cpu_mask(__cpu_to_node(this_cpu));
+#if CONFIG_NUMA
+ int node;
+# define INTERNODE_LB 10

This wants to be put to the other magic constants in the scheduler
and needs an explanation there.

#define nr_running_dec(rq) atomic_dec(rq->node_ptr); \
rq->nr_running--
#define decl_numa_int(ctr) int ctr
+#define decl_numa_nodeint(v) int v[MAX_NUMNODES]

Another one of those.. You should reall just stick the CONFIG_NUMA
ifdef into the actual struct declaration.

+/*
+ * Find the busiest node. All previous node loads contribute with a geometrically
+ * deccaying weight to the load measure:

Linewrapping again..

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Petr.Titera@whitesoft.cz: "Re: Problems with USB"
Previous message: Jan Yenya Kasprzak: "Re: 2.5.56: Two global symbols "io""
In reply to: Erich Focht: "NUMA scheduler 2nd approach"
Next in thread: Erich Focht: "Re: NUMA scheduler 2nd approach"
Reply: Erich Focht: "Re: NUMA scheduler 2nd approach"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Wed Jan 15 2003 - 22:00:44 EST