Re: [CFT][RFC] HT scheduler

From: Rusty Russell
Date: Fri Dec 12 2003 - 21:21:37 EST


In message <3FD9679A.1020404@xxxxxxxxxxxxxxx> you write:
>
> Thanks for having a look Rusty. I'll try to convince you :)
>
> As you know, the domain classes is not just for HT, but can do multi levels
> of NUMA, and it can be built by architecture specific code which is good
> for Opteron, for example. It doesn't need CONFIG_SCHED_SMT either, of
> course,
> or CONFIG_NUMA even: degenerate domains can just be collapsed (code isn't
> there to do that now).

Yes, but this isn't what we really want. I'm actually accusing you of
lacking ambition 8)

> Shared runqueues I find isn't so flexible. I think it perfectly describes
> the P4 HT architecture, but what happens if (when) siblings get seperate
> L1 caches? What about SMT, CMP, SMP and NUMA levels in the POWER5?

It describes every HyperThread implementation I am aware of today, so
it suits us fine for the moment. Runqueues may still be worth sharing
even if L1 isn't, for example.

> The large SGI (and I imagine IBM's POWER5s) systems need things like
> progressive balancing backoff and would probably benefit with a more
> heirachical balancing scheme so all the balancing operations don't kill
> the system.

But this is my point. Scheduling is one part of the problem. I want
to be able to have the arch-specific code feed in a description of
memory and cpu distances, bandwidths and whatever, and have the
scheduler, slab allocator, per-cpu data allocation, page cache, page
migrator and anything else which cares adjust itself based on that.

Power 4 today has pairs of CPUs on a die, four dies on a board, and
four boards in a machine. I want one infrastructure to descibe it,
not have to do program every infrastructure from arch-specific code.

> w26 does ALL this, while sched.o is 3K smaller than Ingo's shared runqueue
> patch on NUMA and SMP, and 1K smaller on UP (although sched.c is 90 lines
> longer). kernbench system time is down nearly 10% on the NUMAQ, so it isn't
> hurting performance either.

Agreed, but Ingo's shared runqueue patch is poor implementation of a
good idea: I've always disliked it. I'm halfway through updating my
patch, and I really think you'll like it better. It's not
incompatible with NUMA changes, in fact it's fairly non-invasive.

> And finally, Linus also wanted the balancing code to be generalised to
> handle SMT, and Ingo said he liked my patch from a first look.

Oh, I like your patch too (except those #defines should really be an
enum). I just think we can do better with less.

Cheers,
Rusty.
--
Anyone who quotes me in their sig is an idiot. -- Rusty Russell.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/