Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

From: Vincent Guittot
Date: Thu Jul 09 2020 - 09:51:26 EST


On Thu, 9 Jul 2020 at 15:06, Valentin Schneider
<valentin.schneider@xxxxxxx> wrote:
>
>
> On 02/07/20 15:42, Vincent Guittot wrote:
> > task_h_load() can return 0 in some situations like running stress-ng
> > mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> > system. The load balance doesn't handle this correctly because
> > env->imbalance never decreases and it will stop pulling tasks only after
> > reaching loop_max, which can be equal to the number of running tasks of
> > the cfs. Make sure that imbalance will be decreased by at least 1.
> >
> > misfit task is the other feature that doesn't handle correctly such
> > situation although it's probably more difficult to face the problem
> > because of the smaller number of CPUs and running tasks on heterogenous
> > system.
> >
> > We can't simply ensure that task_h_load() returns at least one because it
> > would imply to handle underrun in other places.
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
>
> I dug some more into this; if I got my math right, this can be reproduced
> with a single task group below the root. Forked tasks get max load, so this
> can be tried out with either tons of forks or tons of CPU hogs.
>
> We need
>
> p->se.avg.load_avg * cfs_rq->h_load
> ----------------------------------- < 1
> cfs_rq_load_avg(cfs_rq) + 1
>
> Assuming homogeneous system with tasks spread out all over (no other tasks
> interfering), that should boil down to
>
> 1024 * (tg.shares / nr_cpus)
> --------------------------- < 1
> 1024 * (nr_tasks_on_cpu)
>
> IOW
>
> tg.shares / nr_cpus < nr_tasks_on_cpu
>
> If we get tasks nicely spread out, a simple condition to hit this should be
> to have more tasks than shares.
>
> I can hit task_h_load=0 with the following on my Juno (pinned to one CPU to
> make things simpler; big.LITTLE doesn't yield equal weights between CPUs):
>
> cgcreate -g cpu:tg0
>
> echo 128 > /sys/fs/cgroup/cpu/tg0/cpu.shares
>
> for ((i=0; i<130; i++)); do
> # busy loop of your choice
> taskset -c 0 ./loop.sh &
> echo $! > /sys/fs/cgroup/cpu/tg0/tasks
> done
>
> > ---
> > kernel/sched/fair.c | 18 +++++++++++++++++-
> > 1 file changed, 17 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6fab1d17c575..62747c24aa9e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4049,7 +4049,13 @@ static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
> > return;
> > }
> >
> > - rq->misfit_task_load = task_h_load(p);
> > + /*
> > + * Make sure that misfit_task_load will not be null even if
> > + * task_h_load() returns 0. misfit_task_load is only used to select
> > + * rq with highest load so adding 1 will not modify the result
> > + * of the comparison.
> > + */
> > + rq->misfit_task_load = task_h_load(p) + 1;
>
> For here and below; wouldn't it be a tad cleaner to just do
>
> foo = max(task_h_load(p), 1);

+1

For the one below, my goal was mainly to not impact the result of the
tests before applying the +1 but doing it before will not change the
results

I'm going to update it

>
> Otherwise, I think I've properly convinced myself we do want to have
> that in one form or another. So either way:
>
> Reviewed-by: Valentin Schneider <valentin.schneider@xxxxxxx>

Thanks

>
> > }
> >
> > #else /* CONFIG_SMP */
> > @@ -7664,6 +7670,16 @@ static int detach_tasks(struct lb_env *env)
> > env->sd->nr_balance_failed <= env->sd->cache_nice_tries)
> > goto next;
> >
> > + /*
> > + * Depending of the number of CPUs and tasks and the
> > + * cgroup hierarchy, task_h_load() can return a null
> > + * value. Make sure that env->imbalance decreases
> > + * otherwise detach_tasks() will stop only after
> > + * detaching up to loop_max tasks.
> > + */
> > + if (!load)
> > + load = 1;
> > +
> > env->imbalance -= load;
> > break;