Re: [PATCH 0/2 v4] sched: Rewrite per entity runnable load average tracking

From: Morten Rasmussen
Date: Thu Jul 31 2014 - 04:54:43 EST

On Wed, Jul 30, 2014 at 08:17:39PM +0100, Yuyang Du wrote:
> Hi Morten,
> On Wed, Jul 30, 2014 at 11:13:31AM +0100, Morten Rasmussen wrote:
> > > > 2. runnable_load_avg and blocked_load_avg are combined
> > > >
> > > > runnable_load_avg currently represents the sum of load_avg_contrib of
> > > > all tasks on the rq, while blocked_load_avg is the sum of those tasks
> > > > not on a runqueue. It makes perfect sense to consider the sum of both
> > > > when calculating the load of a cpu, but we currently don't include
> > > > blocked_load_avg. The reason for that is the priority scaling of the
> > > > task load_avg_contrib may lead to under-utilization of cpus that
> > > > occasionally have tiny high priority task running. You can easily have a
> > > > task that takes 5% of cpu time but has a load_avg_contrib several times
> > > > larger than a default priority task runnable 100% of the time.
> > >
> > > So this is the effect of historical averaging and weight scaling, both of which
> > > are just generally good, but may have bad cases.
> >
> > I don't agree that weight scaling is generally good. There has been
> > several threads discussing that topic over the last half year or so. It
> > is there to ensure smp niceness, but it makes load-balancing on systems
> > which are not fully utilized sub-optimal. You may end up with some cpus
> > not being fully utilized while others are over-utilized when you have
> > multiple tasks running at different priorities.
> >
> > It is a very real problem when user-space uses priorities extensively
> > like Android does. Tasks related to audio run at very high priorities
> > but only for a very short amount of time, but due the to priority
> > scaling their load ends up being several times higher than tasks running
> > all the time at normal priority. Hence task load is a very poor
> > indicator of utilization.
> I understand the problem you said, but the problem is not described crystal clear.
> You are saying tasks with big weight contribute too much, even they are running
> short time. But is it unfair or does it lead to imbalance? It is hard to say if
> not no. They have big weight, so are supposed to be "unfair" vs. small weight
> tasks for the sake of fairness. In addition, since they are running short time,
> their runnable weight/load is offset by this factor.

It does lead to imbalance and the problem is indeed very real as I
already said. It has been discussed numerous times before:


Default priority (nice=0) has a weight of 1024. nice=-20 has a weight of
88761. So a nice=-20 that runs ~10% of the time has a load contribution
of ~8876, which is >8x the weight of a nice=0 task that runs 100% of the
time. Load contibution is used for load-balancing, which means that you
will put at least eight 100% nice=0 tasks on a cpu before you start
putting any additional tasks on the cpu with the nice=-20 task. So you
over-subscribe one cpu by 700% while another is idle 90% of the time.

You may argue that this is 'fair', but it is very much waste of
resources. Putting nice=0 tasks on the same cpu as the nice=-20 task
will have nearly no effect on the cpu time allocated to nice=-20 task
due to the vruntime scaling. Hence there is virtually no downside in
term of giving priority and a lot to be gained in term of throughput.

Generally, we don't have to care about priority as long as no cpu is
fully utilized. All tasks get the cpu time they need.

The problem with considering blocked priority scaled load is that the
blocked load doesn't disappear when it is blocked, so it effectively
reserves too much cpu time for high priority tasks.

A real work use-case where this happens is described here:

> I think I am saying from pure fairness ponit of view, which is just generally good
> in the sense that we can't think of a more "generally good" thing to replace it.

Unweighted utilization. As said above, we only need to care about
priority when cpus are fully utilized. It doesn't break any fairness.

> And you are saying when big weight task is not runnable, but already contributes
> "too much" load, then leads to under utilization. So this is the matter of our
> predicting algorithm. I am afraid I will say again the pridiction is generally
> good. For the audio example, which is strictly periodic, it just can't be better.

I disagree. The priority scaled prediction is generally bad. Why reserve
up to 88x times more cpu time to a task than is actually needed, when
the unweighted load tracking (utilization) is readily available?

> FWIW, I am really not sure how serious this under utilization problem is in real
> world.

Again, it is indeed a real world problem. We have experienced it first
hand and have been experimenting with this over the last 2-3 years. I'm
not making this up.

We have included unweighted load (utilization) in our RFC patch set for
the same reason. And the out-of-tree big.LITTLE solution carries similar
patches too.

> I am not saying your argument does not make sense. It makes every sense from specific
> case ponit from view. I do think there absolutely can be sub-optimal cases. But as
> I said, I just don't think the problem description is clear enough so that we know
> it is worth solving (by pros and cons comparison) and how to solve it, either
> generally or specifically.

I didn't repeat the whole history in my first response as I thought this
had already been debated several times and we had reached agreement that
is indeed a problem. You are not the first one to propose including
priority scaled blocked load in the load estimation.

> Plus, as Peter said, we have to live with user space uses big weight, and do it as
> what weight is supposed to be.

I don't follow. Are you saying it is fine to intentionally make
load-balancing worse for any user-space that uses task priorities other
than default?

You can't just ignore users of task priority. You may have the point of
view that you don't care about under-utilization, but there are lots of
users who do. Optimizing for energy consumption is a primary goal for
the mobile space (and servers seems to be moving that way too). This
requires more accurate estimates of cpu utilization to manage how many
cpus are needed. Ignoring priority scaling is moving in the exact
opposite direction an conflicts with other ongoing efforts.

Overall, it is not clear to me why it is necessary to rewrite the
per-entity load-tracking. The code is somewhat simpler, but I don't see
any functional additions/improvements. If we have to go through a long
review and testing process, why not address some of the most obvious
issues with the existing implementation while we are at it? I don't see
the point in replacing something sub-optimal with equally sub-optimal
(or worse).

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at