Re: top-down balance purpose discussion -- resend

From: Alex Shi
Date: Wed Jan 22 2014 - 02:41:12 EST


On 01/21/2014 10:57 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 10:04:26PM +0800, Alex Shi wrote:
>>
>> Current scheduler load balance is bottom-up mode, each CPU need
>> initiate the balance by self.
>>
>> 1, Like in a integrate computer system, it has smt/core/cpu/numa, 4
>> level scheduler domains. If there is just 2 tasks in whole system that
>> both running on cpu0. Current load balance need to pull task to another
>> smt in smt domain, then pull task to another core, then pull task to
>> another cpu, finally pull task to another numa. Totally it is need 4
>> times task moving to get system balance.
>
> Except the idle load balancer, and esp. the newidle can totally by-pass
> this.
>
> If you do the packing right in the newidle pass, you'd get there in 1
> step.

It give me a huge pressure to argue with you a great experts. I am
waiting and very appreciate for any comments and corrections. :)

Yes, a newidle will kindly relief this. but it can not eliminate it. If
a newidle happens on another numa group. It just needs 1 step. But if it
happens on another smt group, it still needs 4 steps. So generally, we
still need one more steps before well balance.

In this example, if a newidle is in the same smallest group, maybe we
should wakeup a remotest cpu in system/llc to avoid extra task moving in
near future for best performance.
And for power saving, maybe we'd better kick the task to smallest group,
then let the remote cpu group idle.
But for current newidle, it's impossible to do this because newidle is
also bottom-up mode.
>
>> Generally, the task moving complexity is
>> O(nm log n), n := nr_cpus, m := nr_tasks
>>
>> There is a excellent summary and explanation for this in
>> kernel/sched/fair.c:4605
>
> Which is a perfectly fine scheme for a busy system.
>
>> Another weakness of current LB is that every cpu need to get the other
>> cpus' load info repeatedly and try to figure out busiest sched
>> group/queue on every sched domain level. But it just waste time, since
>> it may not conduct a task moving. One of reasons is that cpu can only
>> pull task, not pushing.
>
> This doesn't make sense.. and in fact, we do a limited amount of 3rd
> party movements.

Yes, but the 3rd party movements is too limited, just for task pinned.
>
> Whatever you do, you have to repeat the information gathering anyhow,
> because it constantly changes.
>

Yes, it is good to collection the load info once for once balance. but
if the balance cpu is busiest cpu, current balance still keep collecting
every group load info from bottom to up, and then do nothing on this
imbalance system. This is bad.

> Trying to serialize that doesn't make any kind of sense. The only thing
> you want is that the system converges.

Sorry, would you like to give a bit more details of 'serialize' is no sense?
>
> Skipped the rest because it seems build on a fundament I don't agree
> with. That 4 move thing is just silly for an idle system, and we
> shouldn't do that.
>
> I also very much do not want a single CPU balancing the entire system,
> that's the anti-thesis of scalable.

Sorry. IMHO, single cpu is possible to handle 1000 cpu balancing. And it
is far more scalable than every cpu do balance in system, since there is
only one cpu need to pick other cpu load info.

BTW, there is no organize among all cpus' balancing currently. That's a
a bit mess. Like if 2 cpus in a small cpu group just do balance for
whole system at the same time, then both of them think self group is
light and want more load. then they have the chance to over pull load to
self group. That is bad. And single balancing has no such problem.

--
Thanks
Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/