Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()

From: Dietmar Eggemann
Date: Fri Jul 28 2017 - 08:16:37 EST


On 26/07/17 15:54, Peter Zijlstra wrote:
> On Tue, Jul 18, 2017 at 08:48:53PM +0100, Dietmar Eggemann wrote:
>> Hi Jeffrey,
>>
>> On 13/07/17 20:55, Jeffrey Hugo wrote:

[...]

>>> Since the group imbalance path in calculate_imbalance() is at best a NOP
>>> but otherwise harmful, remove it.
>
> Hurm.. so fix_small_imbalance() itself is a pile of dog poo... it used
> to make sense a long time ago, but smp-nice and then cgroups made a
> complete joke of things.
>
>> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
>> level sched groups.
>
> That'd be a NUMA box?

I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

[...]

>> but here the prefer_sibling handling (group overloaded) eclipses 'group
>> imbalance' the moment one of the cfs tasks can go to cpu2 so the if
>> condition you got rid of is a nop.
>>
>> I wonder if it is fair to say that your fix helps multi-cluster
>> (especially with n > 2) systems without SMT and with your first patch
>> [1] for this specific, cpu affinity restricted test cases.
>
> I tried on an IVB-EP with all the HT siblings unplugged, could not
> reproduce either. Still at n=2 though. Let me fire up an EX, that'll get
> me n=4.
>
> So this is 4 * 18 * 2 = 144 cpus:

Impressive ;-)

>
> # for ((i=72; i<144; i++)) ; do echo 0 > /sys/devices/system/cpu/cpu$i/online; done
> # taskset -pc 0,18 $$
> # while :; do :; done & while :; do :; done &
>
> So I'm taking SMT out, affine to first and second MC group, start 2
> loops.
>
> Using another console I see them both using 100%.
>
> If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> Then instantly they balance and I get 2x100% back.

Yeah, could reproduce on IVB-EP (2x10x2).

> Anything else I need to reproduce? (other than maybe a slightly less
> insane machine :-)

I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
lowered to 'sds->avg_load' in case of an imbalanced busiest sg:

if (busiest->group_type == group_imbalanced)
busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);

is so low that later fix_small_imbalance() won't be called and
'env->imbalance' stays so low that load-balance of on 50% task to the
now idle cpu won't happen.

if (env->imbalance < busiest->load_per_task)
fix_small_imbalance(env, sds);

Having really a lot of otherwise idle DIE sg's helps to keep
'sds->avg_load' low in comparison to 'busiest->load_per_task'.

> Because I have the feeling that while this patch cures things for you,
> you're fighting symptoms.

Unfortunately, don't have a machine available with n >> 2 (on DIE or
NUMA) ...