Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Tejun Heo
Date: Tue May 02 2017 - 17:51:02 EST


On Tue, May 02, 2017 at 09:18:53AM +0200, Vincent Guittot wrote:
> > dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
> > dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
> > dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
> > dbg_odd_dump: A: CPU001: run=1 schb=1
> > dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
> > dbg_odd_dump: A: Q001-asdf: schbench(153757C):w=1.000,l=0.527,u=0.514
> > dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
> > dbg_odd_dump: A: Q001-/: asdf(C):w=5.744,l=3.017,u=0.521
> > dbg_odd_dump: A: CPU017: run=2 schb=2
> > dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
> > dbg_odd_dump: A: Q017-asdf: schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
> > dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
> > dbg_odd_dump: A: Q017-/: asdf(C):w=10.653,l=5.269,u=0.966
> > dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
> > dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
> > dbg_odd_dump: B: CPU014: run=1 schb=1
> > dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
> > dbg_odd_dump: B: Q014-asdf: schbench(153760C):w=1.000,l=0.491,u=0.476
> > dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
> > dbg_odd_dump: B: Q014-/: asdf(C):w=5.605,l=5.766,u=0.970
> > dbg_odd_dump: B: CPU030: run=1 schb=1
> > dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
> > dbg_odd_dump: B: Q030-asdf: schbench(153747C):w=1.000,l=0.537,u=0.516
> > dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
> > dbg_odd_dump: B: Q030-/: asdf(C):w=5.758,l=3.092,u=0.516
> >
> > You can notice that B's pertask weight is 4.409 which is way higher
> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> > twice as high as it should be. The root queue's runnable avg should
> Are you sure that this is because of blocked load in group A ? it can
> be that Q014-asdf has already have to wait before running and its load
> still increase while runnable but not running .

This is with propagation enabled, so the only thing contributing to
the root queue's runnable_load_avg is the load being propagated from
Q014-asdf, which has twice high load avg than runnable. The past
history doesn't matter for load balancing and without cgroup this
blocked load wouldn't have contributed to root's runnable load avg. I
don't think it can get much clearer.

> IIUC your trace, group A has 2 running tasks and group B only one but
> load_balance selects B because of its sgs->avg_load being higher. But
> this can also happen even if runnable_load_avg of child cfs_rq was
> propagated correctly in group entity because we can have situation
> where a group A has only 1 task with higher load than 2 tasks on
> groupB and even if blocked load is not taken into account, and
> load_balance will select A.

Yes, it can happen with tasks w/ different weights. That's clearly
not what's happening here. The load balancer is picking the wrong CPU
far more frequently because the root queue's runnable load avg
incorrectly includes blocked load avgs from nested cfs_rqs.

> IMHO, we should better improve load balance selection. I'm going to
> add smarter group selection in load_balance. that's something we
> should have already done but it was difficult without load/util_avg
> propagation. it should be doable now

That's all well and great but let's fix a bug first; otherwise, we'd
be papering over an existing issue with a new mechanism which is a bad
idea for any code base which has to last.

> > We can argue whether overriding a cfs_rq se's load_avg to the scaled
> > runnable_load_avg of the cfs_rq is the right way to go or we should
> > introduce a separate channel to propagate runnable_load_avg; however,
> > it's clear that we need to fix runnable_load_avg propagation one way
> > or another.
> The minimum would be to not break load_avg

Oh yeah, this I can understand. The proposed change is icky in that
it forces group se->load_avg.avg to be runnable_load_avg of the
corresponding group cfs_rq. We *can* introduce a separate channel,
say, se->group_runnable_load_avg which is used to propagate
runnable_load_avg; however, the thing is that we don't really use
group se->load_avg.avg anywhere, so we might as well just override it.

I have a preliminary patch to introduce a separate field but it looks
sad too because we end up calculating the load_avg and
runnable_load_avg to propagate separately without actually using the
former value anywhere.

> > The thing with cfs_rq se's load_avg is that, it isn't really used
> > anywhere else AFAICS, so overriding it to the cfs_rq's
> > runnable_load_avg isn't prettiest but doesn't really change anything.
> load_avg is used for defining the share of each cfs_rq.

Each cfs_rq calculates its load_avg independently from the weight sum.
The queued se's load_avgs don't affect cfs_rq's load_avg in any direct
way. The only time the value is used is for propagation during
migration; however, group se themselves never get migrated themselves
and during propagation only deltas matter so the difference between
load_avg and runnable_load_avg isn't gonna matter that much. In
short, we don't really use group se's load_avg in any way significant.