Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity

From: Morten Rasmussen
Date: Thu May 19 2016 - 11:36:14 EST


On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:
> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@xxxxxxxxx> wrote:
> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
> >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35
> >> Gitweb: http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
> >> Author: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
> >> Committer: Ingo Molnar <mingo@xxxxxxxxxx>
> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200
> >>
> >> sched/fair: Correct unit of load_above_capacity
> >>
> >> In calculate_imbalance() load_above_capacity currently has the unit
> >> [capacity] while it is used as being [load/capacity]. Not only is it
> >> wrong it also makes it unlikely that load_above_capacity is ever used
> >> as the subsequent code picks the smaller of load_above_capacity and
> >> the avg_load
> >>
> >> This patch ensures that load_above_capacity has the right unit
> >> [load/capacity].
>
> load_above_capacity has the unit [load] and is then compared with other load

I'm not sure if talk about the same code, I'm referring to:

max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

a bit further down. Here the first option has unit [load/capacity], and
the subsequent code multiplies the result with [capacity] to get back to
[load]:

/* How much load to actually move to equalise the imbalance */
env->imbalance = min(
max_pull * busiest->group_capacity,
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

That lead me to the conclusion that load_above_capacity would have to be
'per capacity', i.e. [load/capacity], as well for the code to make
sense.

>
> >>
> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> >> [ Changed changelog to note it was in capacity unit; +rebase. ]
> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> >> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> >> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> >> Cc: Mike Galbraith <efault@xxxxxx>
> >> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> >> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> >> Cc: linux-kernel@xxxxxxxxxxxxxxx
> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@xxxxxxx
> >> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
> >> ---
> >> kernel/sched/fair.c | 6 ++++--
> >> 1 file changed, 4 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2338105..218f8e8 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >> if (busiest->group_type == group_overloaded &&
> >> local->group_type == group_overloaded) {
> >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
> >> - if (load_above_capacity > busiest->group_capacity)
> >> + if (load_above_capacity > busiest->group_capacity) {
> >> load_above_capacity -= busiest->group_capacity;
> >> - else
> >> + load_above_capacity *= NICE_0_LOAD;
> >> + load_above_capacity /= busiest->group_capacity;
>
> If I follow correctly the sequence,
> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
> so
> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD
>
> so the unit is [load]

I'm with you for the equation, but not for the unit and I find it hard
convince myself what the unit really is :-(

I think it is down to the fact that we are comparing [load] directly
with [capacity] here, basically saying [load] == [capacity]. Most other
places we scale [load] by [capacity], we don't compare the two or
otherwise imply that they are the same unit. Do we have any other
examples of this?

The name load_above_capacity suggests that it should have unit [load],
but we actually compute it by subtracting to values, where the latter
clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97
sched/fair: Clean up scale confusion) changes the former to be
[capacity].

load_above_capacity is later multiplied by [capacity] to determine the
imbalance which must have unit [load]. So working our way backwards,
load_above_capacity must have unit [load/capacity]. However, if [load] =
[capacity] then what we really have is just group-normalized [load].

As said in my other reply, the code only really makes sense for under
special circumstances where [load] == [capacity]. The easy, and possibly
best, way out of this mess would be to replace the code with something
like PeterZ's suggestion that I tried to implement in the patch included
in my other reply.

> Lets take the example of a group made of 2 cores with 2 threads per
> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
> capacity of the group is 3*SCHED_CAPACITY_SCALE.
> If we have 6 tasks on this group :
> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
> more than 1 tasks from the group and let 5 tasks in the group whereas
> we don't expect to have more than 4 tasks as we have 4 CPUs and
> probably less because the compute capacity of each CPU is less than
> the default one
>
> So i would have expected
> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
> load_above capacity = 3*NICE_0_LOAD which means we want to remove no
> more than 3 tasks and let 3 tasks in the group

And this is exactly you get with this patch :-) load_above_capacity
(through max_pull) is multiplied by the group capacity to compute that
actual amount of [load] to remove:

env->imbalance = load_above_capacity * busiest->group_capacity /
SCHED_CAPACITY_SCALE

= 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
SCHED_CAPACITY_SCALE

= 3*NICE_0_LOAD

I don't think we disagree on how it should work :-) Without the capacity
scaling in this patch you get:

env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE

= 9*SCHED_CAPACITY_SCALE

Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
ensure that the resulting imbalance has the proper unit [load].

I'm happy to scrap this patch and replace the whole thing with something
else that makes more sense, but IMHO it is needed if the current mess
should make any sense at all.