Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
From: Tejun Heo
Date: Fri Apr 28 2017 - 16:34:01 EST
Hello, Vincent.
On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 00:52, Tejun Heo <tj@xxxxxxxxxx> wrote:
> > Hello,
> >
> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> On 24 April 2017 at 22:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
> >> Can the problem be on the load balance side instead ? and more
> >> precisely in the wakeup path ?
> >> After looking at the trace, it seems that task placement happens at
> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> you will have to wait for a load balance which is alreayd too late
> >
> > Oh, I was tracing most of scheduler activities and the ratios of
> > wakeups picking idle CPUs were about the same regardless of cgroup
> > membership. I can confidently say that the latency issue that I'm
> > seeing is from load balancer picking the wrong busiest CPU, which is
> > not to say that there can be other problems.
>
> ok. Is there any trace that you can share ? your behavior seems
> different of mine
I'm attaching the debug patch. With your change (avg instead of
runnable_avg), the following trace shows why it's wrong.
It's dumping a case where group A has a CPU w/ more than two schbench
threads and B doesn't, but the load balancer is determining that B is
loaded heavier.
dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
dbg_odd_dump: A: CPU001: run=1 schb=1
dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
dbg_odd_dump: A: Q001-asdf: schbench(153757C):w=1.000,l=0.527,u=0.514
dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
dbg_odd_dump: A: Q001-/: asdf(C):w=5.744,l=3.017,u=0.521
dbg_odd_dump: A: CPU017: run=2 schb=2
dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
dbg_odd_dump: A: Q017-asdf: schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
dbg_odd_dump: A: Q017-/: asdf(C):w=10.653,l=5.269,u=0.966
dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
dbg_odd_dump: B: CPU014: run=1 schb=1
dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
dbg_odd_dump: B: Q014-asdf: schbench(153760C):w=1.000,l=0.491,u=0.476
dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
dbg_odd_dump: B: Q014-/: asdf(C):w=5.605,l=5.766,u=0.970
dbg_odd_dump: B: CPU030: run=1 schb=1
dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
dbg_odd_dump: B: Q030-asdf: schbench(153747C):w=1.000,l=0.537,u=0.516
dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
dbg_odd_dump: B: Q030-/: asdf(C):w=5.758,l=3.092,u=0.516
You can notice that B's pertask weight is 4.409 which is way higher
than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
twice as high as it should be. The root queue's runnable avg should
only contain what's currently active but because we're scaling load
avg which includes both active and blocked, we're ending up picking
group B over A.
This shows up in the total number of times we pick the wrong queue and
thus latency. I'm running the following script with the debug patch
applied.
#!/bin/bash
date
cat /proc/self/cgroup
echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
echo 0 > /sys/module/fair/parameters/dbg_odd_cnt
~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
cat /sys/module/fair/parameters/dbg_odd_cnt
With your patch applied, in the root cgroup,
Fri Apr 28 12:48:59 PDT 2017
0::/
Latency percentiles (usec)
50.0000th: 26
75.0000th: 63
90.0000th: 78
95.0000th: 88
*99.0000th: 707
99.5000th: 5096
99.9000th: 10352
min=0, max=13743
577
In the /asdf cgroup,
Fri Apr 28 13:19:53 PDT 2017
0::/asdf
Latency percentiles (usec)
50.0000th: 35
75.0000th: 67
90.0000th: 81
95.0000th: 98
*99.0000th: 2212
99.5000th: 4536
99.9000th: 11024
min=0, max=13026
1708
The last line is the number of times the load balancer picked a group
w/o more than two schbench threads on a CPU over one w/. Some number
of these are expected as there are other threads and there are some
plays in all the calculations but propgating avg or not propgating at
all significantly increases the count and latency.
> > The issue isn't about whether runnable_load_avg or load_avg should be
> > used but the unexpected differences in the metrics that the load
>
> I think that's the root of the problem. I explain a bit more my view
> on the other thread
So, when picking the busiest group, the only thing which matters is
the queue's runnable_load_avg, which should approximate the sum of all
on-queue loads on that CPU.
If we don't propagate or propagate load_avg, we're factoring in
blocked avg of descendent cgroups into the root's runnable_load_avg
which is obviously wrong.
We can argue whether overriding a cfs_rq se's load_avg to the scaled
runnable_load_avg of the cfs_rq is the right way to go or we should
introduce a separate channel to propagate runnable_load_avg; however,
it's clear that we need to fix runnable_load_avg propagation one way
or another.
The thing with cfs_rq se's load_avg is that, it isn't really used
anywhere else AFAICS, so overriding it to the cfs_rq's
runnable_load_avg isn't prettiest but doesn't really change anything.
Thanks.
--
tejun