Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Vincent Guittot
Date: Wed May 03 2017 - 03:55:33 EST

On 3 May 2017 at 09:25, Vincent Guittot <vincent.guittot@xxxxxxxxxx> wrote:
> On 2 May 2017 at 22:56, Tejun Heo <tj@xxxxxxxxxx> wrote:
>> Hello, Vincent.
>> On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
>>> On 28 April 2017 at 18:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
>>> > I'll follow up in the other subthread but there really is fundamental
>>> > difference in how we calculate runnable_avg w/ and w/o cgroups.
>>> > Indepndent of whether we can improve the load balancer further, it is
>>> > an existing bug.
>>> I'd like to weight that a bit.
>>> The runnable_load_avg works correctly as it is because it reflects
>>> correctly the load of runnable entity at root domain
>>> If you start to propagate the runnable_load_avg on the load_avg of the
>>> group entity, the load will become unstable.
>>> runnable_load_avg has been added to fix load_balance being unable to
>>> select the right busiest rq. So the goal is to use more and more
>>> load_avg not the opposite
>> I have a hard time understanding what you're trying to say here.
>> Without cgroup, the load balancer uses the sum of load_avgs of the
>> running tasks on the queue. As shown by the debug trace, the load
>> balancer repeatedly ends up picking the wrong CPU when cgroup is
>> involved because it ends up including the load_avgs of nested blocked
>> tasks into runnable_load_avg of root - we lose the distinction between
>> running and blocked load_avgs when we pass through a nested cfs_rq.
>> We can further improve the load balancer all we want, for example,
>> right now, we would end up picking a CPU with one task which has a
>> really high weight over another CPU with two normal weight tasks even,
>> which isn't ideal; however, there is something obviously broken in the
>> existing mechanism and we want to fix that first independent of
>> further improvements, and it won't be a good idea to paper over an
>> existing problem with a different mechanism either.
> That's probably where I disagree.
> IMO, runnable_load_avg is already a fix to compensate the fact when
> PELT has been rewritten, the load balance has not been update
> accordingly and was unable to make the right choice
> between 2 groups: one group having only 1 task but a higher load and
> another group with several tasks but a lower load (either because of
> the blocked load or because a runnable task with far higher load than
> others). runnable_load_avg has been put back to quickly fix the 1st
> case but it doesn't fix the other. There were also some issues of
> propagating load update with task migration but this has been fixed
> now.
> So we have 2 use case with the exact same behavior a group with a
> higher load (runnable_load_avg or load_avg) but only 1 task and
> another one with several tasks but a lower load. At now
> runnable_load_avg can only fix one with the impact of breaking
> load_avg behavior whereas one fix in the load_balance can fix both
> withour breaking load_avg behavior.
> We should better work on removing runnable_load_avg completely than
> adding more stuff on it
>>> >> I always let time between 2 consecutive run and the 10 consecutive
>>> >> runs take around 2min to execute
>>> >>
>>> >> Then I have run several time these 10 consecutive test and results stay the same
>>> >
>>> > Can you please try the patch at the end of this message? I'm
>>> > wondering whether you're seeing the errors accumulating from the wrong
>>> > min().
>>> I still have the regression with the patch below.
>>> The regression comes from the use runnable_load_avg to propagate. As
>>> load_avg becomes null at some point, it break the computation of share
>>> and the load_avg stay very low
>> That's surprising given that what the patch does is bringing the
>> cgroup behavior closer to !cgroup behavior. It'd be great to be able
> But it's also break load_avg which is used to calculate per cfs_rq
> share and this change generates the regression on my board
>> to reproduce the problem and trace it. It looks like the board is
>> pretty standardized. Would the following be equivalent to the one you
>> have?
> Yes it's this board i have the previous version but it should not
> change the behavior
>> If so, I can just buy it, get your test image and repro it here and
>> trace why the regression is happening with the setup. We might be
>> hitting something else.
> I have pushed my branch which includes v4.11-rc8 + your patches + some
> debug trace here:
> and uploaded a trace_cmd's trace that shows the regression

I have also uploaded the trace withour your patches so you can compare both:

> As an example, we can easily see arounf time stamp 94.826469 sec that
> cpu5 is idle while CPU2 and CPU3 share their time between several
> task.
> This doesn't happens without your patch
>> Thanks.
>> --
>> tejun