Re: [PATCH 3/7] sched/fair: Correct unit of load_above_capacity
From: Peter Zijlstra
Date: Tue May 03 2016 - 06:52:37 EST
On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote:
> From: Morten Rasmussen <morten.rasmussen@xxxxxxx>
>
> In calculate_imbalance() load_above_capacity currently has the unit
> [load] while it is used as being [load/capacity]. Not only is it wrong it
> also makes it unlikely that load_above_capacity is ever used as the
> subsequent code picks the smaller of load_above_capacity and the avg_load
> difference between the local and the busiest sched_groups.
>
> This patch ensures that load_above_capacity has the right unit
> [load/capacity].
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> ---
> kernel/sched/fair.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bc19fee94f39..c066574cff04 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> local->group_type == group_overloaded) {
> load_above_capacity = busiest->sum_nr_running *
> SCHED_LOAD_SCALE;
Given smpnice and cgroups; this starting point seems somewhat random.
Then again, without cgroup and assuming all tasks are nice-0 it is still
true. But I think the cgroup muck is rather universal these days.
The above SCHED_LOAD_SCALE is the 1 -> load metric, right?
> - if (load_above_capacity > busiest->group_capacity)
> + if (load_above_capacity > busiest->group_capacity) {
> load_above_capacity -= busiest->group_capacity;
> - else
> + load_above_capacity *= SCHED_LOAD_SCALE;
> + load_above_capacity /= busiest->group_capacity;
And this one does load->capacity
> + } else
> load_above_capacity = ~0UL;
> }
Is there anything better we can do? I know there's a fair bit of cruft
in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug
the hell out of me.
Would it make sense to make load_balance()->detach_tasks() ensure
busiest->sum_nr_running >= busiest->group_weight or so?