Re: [PATCH] sched/fair: handle case of task_h_load() returning 0

From: Vincent Guittot
Date: Thu Jul 09 2020 - 09:52:45 EST


On Thu, 9 Jul 2020 at 15:34, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
>
> On 08/07/2020 11:47, Vincent Guittot wrote:
> > On Wed, 8 Jul 2020 at 11:45, Dietmar Eggemann <dietmar.eggemann@xxxxxxx> wrote:
> >>
> >> On 02/07/2020 16:42, Vincent Guittot wrote:
> >>> task_h_load() can return 0 in some situations like running stress-ng
> >>> mmapfork, which forks thousands of threads, in a sched group on a 224 cores
> >>> system. The load balance doesn't handle this correctly because
> >>
> >> I guess the issue here is that 'cfs_rq->h_load' in
> >>
> >> task_h_load() {
> >> struct cfs_rq *cfs_rq = task_cfs_rq(p);
> >> ...
> >> return div64_ul(p->se.avg.load_avg * cfs_rq->h_load,
> >> cfs_rq_load_avg(cfs_rq) + 1);
> >> }
> >>
> >> is still ~0 (or at least pretty small) compared to se.avg.load_avg being
> >> 1024 and cfs_rq_load_avg(cfs_rq) n*1024 in these lb occurrences.
> >>
> >>> env->imbalance never decreases and it will stop pulling tasks only after
> >>> reaching loop_max, which can be equal to the number of running tasks of
> >>> the cfs. Make sure that imbalance will be decreased by at least 1.
>
> Looks like it's bounded by sched_nr_migrate (32 on my E5-2690 v2).

yes

>
> env.loop_max = min(sysctl_sched_nr_migrate, busiest->nr_running);
>
> [...]
>
> >> I assume that this is related to the LKP mail
> >
> > I have found this problem while studying the regression raised in the
> > email below but it doesn't fix it. At least, it's not enough
> >
> >>
> >> https://lkml.kernel.org/r/20200421004749.GC26573@shao2-debian ?
>
> I see. It also happens with other workloads but it's most visible
> at the beginning of a workload (fork).
>
> Still on E5-2690 v2 (2*2*10, 40 CPUs):
>
> In the taskgroup cfs_rq->h_load is ~ 1024/40 = 25 so this leads to
> task_h_load = 0 with cfs_rq->avg.load_avg 40 times higher than the
> individual task load (1024).
>
> One incarnation of 20 loops w/o any progress (that's w/o your patch).
>
> With loop='loop/loop_break/loop_max'
> and load='p->se.avg.load_avg/cfs_rq->h_load/cfs_rq->avg.load_avg'
>
> Jul 9 10:41:18 e105613-lin kernel: [73.068844] [stress-ng-mmapf 2907] SMT CPU37->CPU17 imb=8 loop=1/32/32 load=1023/23/43006
> Jul 9 10:41:18 e105613-lin kernel: [73.068873] [stress-ng-mmapf 3501] SMT CPU37->CPU17 imb=8 loop=2/32/32 load=1022/23/41983
> Jul 9 10:41:18 e105613-lin kernel: [73.068890] [stress-ng-mmapf 2602] SMT CPU37->CPU17 imb=8 loop=3/32/32 load=1023/23/40960
> ...
> Jul 9 10:41:18 e105613-lin kernel: [73.069136] [stress-ng-mmapf 2520] SMT CPU37->CPU17 imb=8 loop=18/32/32 load=1023/23/25613
> Jul 9 10:41:18 e105613-lin kernel: [73.069144] [stress-ng-mmapf 3107] SMT CPU37->CPU17 imb=8 loop=19/32/32 load=1021/23/24589
> Jul 9 10:41:18 e105613-lin kernel: [73.069149] [stress-ng-mmapf 2672] SMT CPU37->CPU17 imb=8 loop=20/32/32 load=1024/23/23566
> ...
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Tested-by: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>

Thanks

>
>
>
>
>
>
>