Re: [PATCH v2 1/3] sched/fair: fix rounding issue for asym packing

From: Vincent Guittot
Date: Wed Dec 19 2018 - 03:32:16 EST


On Tue, 18 Dec 2018 at 18:32, Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
>
> On 14/12/2018 16:01, Vincent Guittot wrote:
> > When check_asym_packing() is triggered, the imbalance is set to :
> > busiest_stat.avg_load * busiest_stat.group_capacity / SCHED_CAPACITY_SCALE
> > busiest_stat.avg_load also comes from a division and the final rounding
> > can make imbalance slightly lower than the weighted load of the cfs_rq.
> > But this is enough to skip the rq in find_busiest_queue and prevents asym
> > migration to happen.
> >
> > Add 1 to the avg_load to make sure that the targeted cpu will not be
> > skipped unexpectidly.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 6 ++++++
> > 1 file changed, 6 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index ca46964..c215f7a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -8215,6 +8215,12 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> > /* Adjust by relative CPU capacity of the group */
> > sgs->group_capacity = group->sgc->capacity;
> > sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
> > + /*
> > + * Prevent division rounding to make the computation of imbalance
> > + * slightly less than original value and to prevent the rq to be then
> > + * selected as busiest queue
> > + */
> > + sgs->avg_load += 1;
>
> I've tried investigating this off-by-one issue by running lmbench, but it
> seems to be a gnarly one.
>
>
>
> The adventure starts in update_sg_lb_stats() where we compute
> sgs->avg_load:
>
> sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity
>
> Then we drop by find_busiest_group() and call check_asym_packing() which,
> if we do need to pack, does:
>
> env->imbalance = DIV_ROUND_CLOSEST(
> sds->busiest_stat.avg_load * sds->busiest_stat.group_capacity,
> SCHED_CAPACITY_SCALE);
>
> And finally the one check that seems to be failing, and that you're
> addressing with a +1 is in find_busiest_queue():
>
> if (rq->nr_running == 1 && wl > env->imbalance &&
> !check_cpu_capacity(rq, env->sd))
> continue;
>
>
>
> Now, running lmbench on my HiKey960 hacked up to use asym packing, I
> get an example where there's a task that should be migrated via asym
> packing but isn't:
>
> # This a DIE level balance
> update_sg_lb_stats():
> cpu=0 load=1
> cpu=1 load=1023
> cpu=2 load=0
> cpu=3 load=0
>
> avg_load=568
>
> # Busiest group is [0-3]
> find_busiest_group(): check_asym_packing():
> env->imbalance = 1022
>
> find_busiest_queue():
> cpu=0 wl=1
> cpu=1 wl=1023
> cpu=2 wl=0
> cpu=3 wl=0
>
> Only CPU1 has rq->nr_running > 0 (==1 actually), but wl > env->imbalance
> so we skip it in find_busiest_queue().
>
> Now, I thought it was due to IRQ/rt pressure - 1840 isn't the maximum
> group capacity for the LITTLE cluster, it should be (463 * 4) == 1852.
> I got curious and threw random group capacity values in that equation while
> keeping the same avg_load [1], and it gets wild: you have a signal that
> oscillates between 1024 and 1014!

hmm ... this seems to not be related with $subject

>
> [1]: https://gist.github.com/valschneider/f42252e83be943d276b901f57734b8ba
>
> Adding +1 to avg_load doesn't solve those pesky rounding errors that get
> worse as group_capacity grows, but it reverses the trend: the signal
> oscillates above 1024.
>
>
>
> So in the end a +1 *could* "fix" this, but
> a) It'd make more sense to have it in the check_asym_packing() code
> b) It's ugly and it makes me feel like this piece of code is flimsy AF.

This is another UC, asym packing is used at SMT level for now and we
don't face this kind of problem, it has been also tested and DynamiQ
configuration which is similar to SMT : 1 CPU per sched_group
The legacy b.L one was not the main target although it can be

The rounding error is there and should be fixed and your UC is another
case for legacy b.L that can be treated in another patch

>
> >
> > if (sgs->sum_nr_running)
> > sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
> >