Re: [PATCH 2.5.58] new NUMA scheduler: fix

From: Ingo Molnar (mingo@elte.hu)
Date: Thu Jan 16 2003 - 15:19:22 EST


On Thu, 16 Jan 2003, Martin J. Bligh wrote:

> > complex. It's the one that is aware of the global scheduling picture. For
> > NUMA i'd suggest two asynchronous frequencies: one intra-node frequency,
> > and an inter-node frequency - configured by the architecture and roughly
> > in the same proportion to each other as cachemiss latencies.
>
> That's exactly what's in the latest set of patches - admittedly it's a
> multiplier of when we run load_balance, not the tick multiplier, but
> that's very easy to fix. Can you check out the stuff I posted last
> night? I think it's somewhat cleaner ...

yes, i saw it, it has the same tying between idle-CPU-rebalance and
inter-node rebalance, as Erich's patch. You've put it into
cpus_to_balance(), but that still makes rq->nr_balanced a 'synchronously'
coupled balancing act. There are two synchronous balancing acts currently:
the 'CPU just got idle' event, and the exec()-balancing (*) event. Neither
must involve any 'heavy' balancing, only local balancing. The inter-node
balancing (which is heavier than even the global SMP balancer), should
never be triggered from the high-frequency path. [whether it's high
frequency or not depends on the actual workload, but it can be potentially
_very_ high frequency, easily on the order of 1 million times a second -
then you'll call the inter-node balancer 100K times a second.]

I'd strongly suggest to decouple the heavy NUMA load-balancing code from
the fastpath and re-check the benchmark numbers.

        Ingo

(*) whether sched_balance_exec() is a high-frequency path or not is up to
debate. Right now it's not possible to get much more than a couple of
thousand exec()'s per second on fast CPUs. Hopefully that will change in
the future though, so exec() events could become really fast. So i'd
suggest to only do local (ie. SMP-alike) balancing in the exec() path, and
only do NUMA cross-node balancing with a fixed frequency, from the timer
tick. But exec()-time is really special, since the user task usually has
zero cached state at this point, so we _can_ do cheap cross-node balancing
as well. So it's a boundary thing - probably doing the full-blown
balancing is the right thing.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jan 23 2003 - 22:00:14 EST