Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Tejun Heo
Date: Tue May 02 2017 - 16:56:47 EST


Hello, Vincent.

On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
> On 28 April 2017 at 18:14, Tejun Heo <tj@xxxxxxxxxx> wrote:
> > I'll follow up in the other subthread but there really is fundamental
> > difference in how we calculate runnable_avg w/ and w/o cgroups.
> > Indepndent of whether we can improve the load balancer further, it is
> > an existing bug.
>
> I'd like to weight that a bit.
> The runnable_load_avg works correctly as it is because it reflects
> correctly the load of runnable entity at root domain
> If you start to propagate the runnable_load_avg on the load_avg of the
> group entity, the load will become unstable.
> runnable_load_avg has been added to fix load_balance being unable to
> select the right busiest rq. So the goal is to use more and more
> load_avg not the opposite

I have a hard time understanding what you're trying to say here.

Without cgroup, the load balancer uses the sum of load_avgs of the
running tasks on the queue. As shown by the debug trace, the load
balancer repeatedly ends up picking the wrong CPU when cgroup is
involved because it ends up including the load_avgs of nested blocked
tasks into runnable_load_avg of root - we lose the distinction between
running and blocked load_avgs when we pass through a nested cfs_rq.

We can further improve the load balancer all we want, for example,
right now, we would end up picking a CPU with one task which has a
really high weight over another CPU with two normal weight tasks even,
which isn't ideal; however, there is something obviously broken in the
existing mechanism and we want to fix that first independent of
further improvements, and it won't be a good idea to paper over an
existing problem with a different mechanism either.

> >> I always let time between 2 consecutive run and the 10 consecutive
> >> runs take around 2min to execute
> >>
> >> Then I have run several time these 10 consecutive test and results stay the same
> >
> > Can you please try the patch at the end of this message? I'm
> > wondering whether you're seeing the errors accumulating from the wrong
> > min().
>
> I still have the regression with the patch below.
> The regression comes from the use runnable_load_avg to propagate. As
> load_avg becomes null at some point, it break the computation of share
> and the load_avg stay very low

That's surprising given that what the patch does is bringing the
cgroup behavior closer to !cgroup behavior. It'd be great to be able
to reproduce the problem and trace it. It looks like the board is
pretty standardized. Would the following be equivalent to the one you
have?

http://a.co/f3dD1lm

If so, I can just buy it, get your test image and repro it here and
trace why the regression is happening with the setup. We might be
hitting something else.

Thanks.

--
tejun