Re: [PATCH V6] sched/fair: Remove group imbalance from calculate_imbalance()

From: Peter Zijlstra
Date: Fri Jul 28 2017 - 08:59:37 EST


On Fri, Jul 28, 2017 at 01:16:24PM +0100, Dietmar Eggemann wrote:
> >> IIRC the topology you had in mind was MC + DIE level with n (n > 2) DIE
> >> level sched groups.
> >
> > That'd be a NUMA box?
>
> I don't think it's NUMA. SD level are MC, DIE w/ # DIE sg's >> 2.

Ah, I can't read. I thought >2 DIEs.

> > So this is 4 * 18 * 2 = 144 cpus:
>
> Impressive ;-)

Takes forever to boot though :/

> > If I then start a 3rd loop, I see 100% 50%,50%. I then kill the 100%.
> > Then instantly they balance and I get 2x100% back.
>
> Yeah, could reproduce on IVB-EP (2x10x2).

OK, I have one of those. What should I do, because I didn't actually see
anything odd.

> > Anything else I need to reproduce? (other than maybe a slightly less
> > insane machine :-)
>
> I guess what Jeff is trying to avoid is that 'busiest->load_per_task'
> lowered to 'sds->avg_load' in case of an imbalanced busiest sg:
>
> if (busiest->group_type == group_imbalanced)
> busiest->load_per_task = min(busiest->load_per_task, sds->avg_load);
>
> is so low that later fix_small_imbalance() won't be called and
> 'env->imbalance' stays so low that load-balance of on 50% task to the
> now idle cpu won't happen.
>
> if (env->imbalance < busiest->load_per_task)
> fix_small_imbalance(env, sds);
>
> Having really a lot of otherwise idle DIE sg's helps to keep
> 'sds->avg_load' low in comparison to 'busiest->load_per_task'.

Right, but the whole load_per_task thing is a bit wonky, and since
that's the basis of fix_small_imbalance() I'm very suspect.