Re: [tip:sched/core] sched/fair: Correct unit of load_above_capacity
From: Vincent Guittot
Date: Fri May 20 2016 - 04:18:12 EST
Hi Morten,
On 19 May 2016 at 17:36, Morten Rasmussen <morten.rasmussen@xxxxxxx> wrote:
> On Fri, May 13, 2016 at 10:22:19AM +0200, Vincent Guittot wrote:
>> On 12 May 2016 at 23:48, Yuyang Du <yuyang.du@xxxxxxxxx> wrote:
>> > On Thu, May 12, 2016 at 03:31:51AM -0700, tip-bot for Morten Rasmussen wrote:
>> >> Commit-ID: cfa10334318d8212d007da8c771187643c9cef35
>> >> Gitweb: http://git.kernel.org/tip/cfa10334318d8212d007da8c771187643c9cef35
>> >> Author: Morten Rasmussen <morten.rasmussen@xxxxxxx>
>> >> AuthorDate: Fri, 29 Apr 2016 20:32:40 +0100
>> >> Committer: Ingo Molnar <mingo@xxxxxxxxxx>
>> >> CommitDate: Thu, 12 May 2016 09:55:33 +0200
>> >>
>> >> sched/fair: Correct unit of load_above_capacity
>> >>
>> >> In calculate_imbalance() load_above_capacity currently has the unit
>> >> [capacity] while it is used as being [load/capacity]. Not only is it
>> >> wrong it also makes it unlikely that load_above_capacity is ever used
>> >> as the subsequent code picks the smaller of load_above_capacity and
>> >> the avg_load
>> >>
>> >> This patch ensures that load_above_capacity has the right unit
>> >> [load/capacity].
>>
>> load_above_capacity has the unit [load] and is then compared with other load
>
> I'm not sure if talk about the same code, I'm referring to:
>
> max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
>
> a bit further down. Here the first option has unit [load/capacity], and
> the subsequent code multiplies the result with [capacity] to get back to
> [load]:
My understand is that it multiplies and divides by capacity
so we select the minimum between
max_pull * busiest->group_capacity/SCHED_CAPACITY_SCALE
and
(sds->avg_load - local->avg_load) * local->group_capacity /
SCHED_CAPACITY_SCALE
so the unit of imbalance is the same as max_pull because we multiply
and divide by [capacity]
the imbalance's unit is [load] so max_pull also has a unit [load]
and max_pull has the same unit of load_above_capacity
>
> /* How much load to actually move to equalise the imbalance */
> env->imbalance = min(
> max_pull * busiest->group_capacity,
> (sds->avg_load - local->avg_load) * local->group_capacity
> ) / SCHED_CAPACITY_SCALE;
>
> That lead me to the conclusion that load_above_capacity would have to be
> 'per capacity', i.e. [load/capacity], as well for the code to make
> sense.
>
>>
>> >>
>> >> Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
>> >> [ Changed changelog to note it was in capacity unit; +rebase. ]
>> >> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
>> >> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
>> >> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
>> >> Cc: Mike Galbraith <efault@xxxxxx>
>> >> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
>> >> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
>> >> Cc: linux-kernel@xxxxxxxxxxxxxxx
>> >> Link: http://lkml.kernel.org/r/1461958364-675-4-git-send-email-dietmar.eggemann@xxxxxxx
>> >> Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
>> >> ---
>> >> kernel/sched/fair.c | 6 ++++--
>> >> 1 file changed, 4 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 2338105..218f8e8 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -7067,9 +7067,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> >> if (busiest->group_type == group_overloaded &&
>> >> local->group_type == group_overloaded) {
>> >> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE;
>> >> - if (load_above_capacity > busiest->group_capacity)
>> >> + if (load_above_capacity > busiest->group_capacity) {
>> >> load_above_capacity -= busiest->group_capacity;
>> >> - else
>> >> + load_above_capacity *= NICE_0_LOAD;
>> >> + load_above_capacity /= busiest->group_capacity;
>>
>> If I follow correctly the sequence,
>> load_above_capacity = (busiest->sum_nr_running * SCHED_CAPACITY_SCALE
>> - busiest->group_capacity) * NICE_0_LOAD / busiest->group_capacity
>> so
>> load_above_capacity = busiest->sum_nr_running * SCHED_CAPACITY_SCALE /
>> busiest->group_capacity * NICE_0_LOAD - NICE_0_LOAD
>>
>> so the unit is [load]
>
> I'm with you for the equation, but not for the unit and I find it hard
> convince myself what the unit really is :-(
the sole NICE_0_LOAD in the equation above tends to confirms that it's
only [load]
>
> I think it is down to the fact that we are comparing [load] directly
> with [capacity] here, basically saying [load] == [capacity]. Most other
> places we scale [load] by [capacity], we don't compare the two or
> otherwise imply that they are the same unit. Do we have any other
> examples of this?
IIUC the goal of this part of calculate_imbalance, we make the
assumption that one task with NICE_0_LOAD should fit in one core with
SCHED_CAPACITY_SCALE capacity and we want to ensure that we will not
make a CPU idle
>
> The name load_above_capacity suggests that it should have unit [load],
> but we actually compute it by subtracting to values, where the latter
> clearly has unit [capacity]. PeterZ's recent patch (1be0eb2a97
> sched/fair: Clean up scale confusion) changes the former to be
> [capacity].
I have made a comment on this.
The original equation was
load_above_capacity -= busiest->group_capacity
which come from the optimization of
load_above_capacity -= busiest->group_capacity * SCHED_LOAD_SCALE /
SCHED_CAPACITY_SCALE
The assumption of the optimization was the SCHED_LOAD_SCALE ==
SCHED_CAPACITY_SCALE and SCHED_LOAD_SCALE == NICE_0_LOAD but it's not
always true if we enable the extra precision for 64bits system
>
> load_above_capacity is later multiplied by [capacity] to determine the
> imbalance which must have unit [load]. So working our way backwards,
it's multiplied by xxx->group_capacity and divided by / SCHED_CAPACITY_SCALE;
> load_above_capacity must have unit [load/capacity]. However, if [load] =
> [capacity] then what we really have is just group-normalized [load].
>
> As said in my other reply, the code only really makes sense for under
> special circumstances where [load] == [capacity]. The easy, and possibly
> best, way out of this mess would be to replace the code with something
> like PeterZ's suggestion that I tried to implement in the patch included
> in my other reply.
I'm fine with replacing this part by something more simple
>
>> Lets take the example of a group made of 2 cores with 2 threads per
>> core and the capacity of each core is 1,5* SCHED_CAPACITY_SCALE so the
>> capacity of the group is 3*SCHED_CAPACITY_SCALE.
>> If we have 6 tasks on this group :
>> load_above capacity = 1 *NICE_0_LOAD which means we want to remove no
>> more than 1 tasks from the group and let 5 tasks in the group whereas
>> we don't expect to have more than 4 tasks as we have 4 CPUs and
>> probably less because the compute capacity of each CPU is less than
>> the default one
>>
>> So i would have expected
>> load_above_capacity = busiest->sum_nr_running * NICE_0_LOAD -
>> busiest->group_capacity * NICE_0_LOAD / SCHED_CAPACITY_SCALE
>> load_above capacity = 3*NICE_0_LOAD which means we want to remove no
>> more than 3 tasks and let 3 tasks in the group
>
> And this is exactly you get with this patch :-) load_above_capacity
> (through max_pull) is multiplied by the group capacity to compute that
> actual amount of [load] to remove:
I forgot that additional weight of the load by group's
capacity/SCHED_CAPACITY_SCALE
>
> env->imbalance = load_above_capacity * busiest->group_capacity /
> SCHED_CAPACITY_SCALE
>
> = 1*NICE_0_LOAD * 3*SCHED_CAPACITY_SCALE /
> SCHED_CAPACITY_SCALE
>
> = 3*NICE_0_LOAD
>
> I don't think we disagree on how it should work :-) Without the capacity
> scaling in this patch you get:
>
> env->imbalance = (6*SCHED_CAPACITY_SCALE - 3*SCHED_CAPACITY_SCALE) *
> 3*SCHED_CAPACITY_SCALE / SCHED_CAPACITY_SCALE
>
> = 9*SCHED_CAPACITY_SCALE
>
> Coming back to Yuyang's question. I think it should be NICE_0_LOAD to
> ensure that the resulting imbalance has the proper unit [load].
yes, i agree it's should NICE_0_LOAD
>
> I'm happy to scrap this patch and replace the whole thing with something
> else that makes more sense, but IMHO it is needed if the current mess
> should make any sense at all.