Re: 2.6.4-mm1

From: Andi Kleen
Date: Fri Mar 12 2004 - 09:17:48 EST


On Fri, Mar 12, 2004 at 03:24:43PM +1100, Nick Piggin wrote:
>
>
> Andi Kleen wrote:
>
> >On Thu, Mar 11, 2004 at 07:04:50PM -0800, Nakajima, Jun wrote:
> >
> >>As we can have more complex architectures in the future, the scheduler
> >>is flexible enough to represent various scheduling domains effectively,
> >>and yet keeps the common scheduler code simple.
> >>
> >
> >I think for SMT alone it's too complex and for NUMA it doesn't do
> >the right thing for "modern NUMAs" (where NUMA factor is very low
> >and you have a small number of CPUs for each node).
> >
> >
>
> For SMT it is a less complex than shared runqueues, it is actually
> less lines of code and smaller object size.

By moving all the complexity into arch/* ?

>
> It is also more flexible than shared runqueues in that you can still
> have control over each sibling's runqueue. Con's SMT nice patch for
> example would probably be more difficult to do with shared runqueues.
> Shared runqueues also gives zero affinity to siblings. While current
> implementations may not (do they?) care, future ones might.
>
> For Opteron type NUMA, it actually balances much more aggressively
> than the default NUMA scheduler, especially when a CPU is idle. I
> don't doubt you aren't seeing great performance, but it should be
> able to be fixed.
>
> The problem is just presumably your lack of time to investigate
> further, and my lack of problem descriptions or Opterons.

I didn't investigate further on your scheduler because I have my
doubts about it being the right approach and it seems to have
some obvious design bugs (like the racy SMT setup)

The problem description is still the same as it was in the past.

Basically it is: schedule as on SMP, but avoid local affinity for newly
created tasks and balance early. Allow to disable all old style NUMA
heuristics.

Longer term some homenode scheduling affinity may be still useful,
but I tried to get that to work on 2.4 and failed, so I'm not sure
it can be done. The right way may be to keep track how much memory
each thread allocated on each node and preferably schedule on
the node with the most memory. But that's future work.

>
> One thing you definitely want is a sched_balance_fork, is that right?
> Have you been able to do any benchmarks on recent -mm kernels?

I sent the last benchmarks I did to you (including the tweaks you
suggested). All did worse than the standard scheduler. Did you
change anything significant that makes rebenchmarking useful?

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/