Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
From: Vincent Guittot
Date: Thu Apr 27 2017 - 04:28:40 EST
On 27 April 2017 at 02:30, Tejun Heo <tj@xxxxxxxxxx> wrote:
> Hello, Vincent.
>
> On Wed, Apr 26, 2017 at 12:21:52PM +0200, Vincent Guittot wrote:
>> > This is from the follow-up patch. I was confused. Because we don't
>> > propagate decays, we still should decay the runnable_load_avg;
>> > otherwise, we end up accumulating errors in the counter. I'll drop
>> > the last patch.
>>
>> Ok, the runnable_load_avg goes back to 0 when I drop patch 3. But i
>> see runnable_load_avg sometimes significantly higher than load_avg
>> which is normally not possible as load_avg = runnable_load_avg +
>> sleeping task's load_avg
>
> So, while load_avg would eventually converge on runnable_load_avg +
> blocked load_avg given stable enough workload for long enough,
> runnable_load_avg jumping above load avg temporarily is expected,
No, it's not. Look at load_avg/runnable_avg at root domain when only
task are involved, runnable_load_avg will never be higher than
load_avg because
load_avg = /sum load_avg of tasks attached to the cfs_rq
runnable_load_avg = /sum load_avg of tasks attached and enqueued
to the cfs_rq
load_avg = runnable_load_avg + blocked tasks and as a result
runnable_load_avg is always lower than load_avg.
And with the propagate load/util_avg patchset, we can even reflect
task migration directly at root domain whereas we had to wait for the
util/load_avg and runnable_load_avg to converge to the new value
before
Just to confirm one of my assumption, the latency regression was
already there in previous kernel versions and is not a result of
propagate load/util_avg patchset, isn't it ?
> AFAICS. That's the whole point of it, a sum closely tracking what's
> currently on the cpu so that we can pick the cpu which has the most on
> it now. It doesn't make sense to try to pick threads off of a cpu
> which is generally loaded but doesn't have much going on right now,
> after all.
The only interest of runnable_load_avg is to be null when a cfs_rq is
idle whereas load_avg is not but not to be higher than load_avg. The
root cause is that load_balance only looks at "load" but not number of
task currently running and that's probably the main problem:
runnable_load_avg has been added because load_balance fails to filter
idle group and idle rq. We should better add a new type in
group_classify to tag group that are idle and the same in
find_busiest_queue more.
>
>> Then, I just have the opposite behavior on my platform. I see a
>> increase of latency at p99 with your patches.
>> My platform is a hikey : 2x4 cores ARM and I have used schbench -m 2
>> -t 4 -s 10000 -c 15000 -r 30 so I have 1 worker thread per CPU which
>> is similar to what you are doing on your platform
>>
>> With v4.11-rc8. I have run 10 times the test and get consistent results
> ...
>> *99.0000th: 539
> ...
>> With your patches i see an increase of the latency for p99. I run 10
>> *99.0000th: 2034
>
> I see. This is surprising given that at least the purpose of the
> patch is restoring cgroup behavior to match !cgroup one. I could have
> totally messed it up tho. Hmm... there are several ways forward I
> guess.
>
> * Can you please double check that the higher latencies w/ the patch
> is reliably reproducible? The test machines that I use have
> variable management load. They never dominate the machine but are
> enough to disturb the results so that to drawing out a reliable
> pattern takes a lot of repeated runs. I'd really appreciate if you
> could double check that the pattern is reliable with different run
> patterns (ie. instead of 10 consecutive runs after another,
> interleaved).
I always let time between 2 consecutive run and the 10 consecutive
runs take around 2min to execute
Then I have run several time these 10 consecutive test and results stay the same
>
> * Is the board something easily obtainable? It'd be the eaisest for
> me to set up the same environment and reproduce the problem. I
> looked up hikey boards on amazon but couldn't easily find 2x4 core
It is often named hikey octo cores but I use 2x4 cores just to point
out that there are 2 clusters which is important for scheduler
topology and behavior
> ones. If there's something I can easily buy, please point me to it.
> If there's something I can loan, that'd be great too.
It looks like most of web site are currently out of stock
>
> * If not, I'll try to clean up the debug patches I have and send them
> your way to get more visiblity but given these things tend to be
> very iterative, it might take quite a few back and forth.
Yes, that could be usefull. Even a trace of regression could be useful
I can also push on my git tree the debug patch that i use for tracking
load metrics if you want. It's ugly but it does the job
Thanks
>
> Thanks!
>
> --
> tejun