Re: [PATCH 3/7] sched/fair: Correct unit of load_above_capacity

From: Morten Rasmussen
Date: Tue May 03 2016 - 10:56:34 EST


On Tue, May 03, 2016 at 12:52:33PM +0200, Peter Zijlstra wrote:
> On Fri, Apr 29, 2016 at 08:32:40PM +0100, Dietmar Eggemann wrote:
> > From: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> >
> > In calculate_imbalance() load_above_capacity currently has the unit
> > [load] while it is used as being [load/capacity]. Not only is it wrong it
> > also makes it unlikely that load_above_capacity is ever used as the
> > subsequent code picks the smaller of load_above_capacity and the avg_load
> > difference between the local and the busiest sched_groups.
> >
> > This patch ensures that load_above_capacity has the right unit
> > [load/capacity].
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > ---
> > kernel/sched/fair.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bc19fee94f39..c066574cff04 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7016,9 +7016,11 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> > local->group_type == group_overloaded) {
> > load_above_capacity = busiest->sum_nr_running *
> > SCHED_LOAD_SCALE;
>
> Given smpnice and cgroups; this starting point seems somewhat random.
> Then again, without cgroup and assuming all tasks are nice-0 it is still
> true. But I think the cgroup muck is rather universal these days.

It makes sense for non-grouped always-running nice-0 tasks, and not so
much for all other scenarios. We ignore smpnice, PELT, and cgroups :-(

>
> The above SCHED_LOAD_SCALE is the 1 -> load metric, right?

Yes. For clarity I would probably have used NICE_O_LOAD instead. We are
establishing how much capacity sum_nr_running task would consume if they
are all non-grouped always-running nice-0 tasks.

>
> > - if (load_above_capacity > busiest->group_capacity)
> > + if (load_above_capacity > busiest->group_capacity) {
> > load_above_capacity -= busiest->group_capacity;

If they would consume more than we have, we return how much more
otherwise LONG_MAX.

> > - else
> > + load_above_capacity *= SCHED_LOAD_SCALE;
> > + load_above_capacity /= busiest->group_capacity;
>
> And this one does load->capacity

It does load->[load/capacity]

The subsequent code does [load/capacity]->load:

env->imbalance = min(
--> max_pull * busiest->group_capacity,
(sds->avg_load - local->avg_load) * local->group_capacity
) / SCHED_CAPACITY_SCALE;

>
> > + } else
> > load_above_capacity = ~0UL;
> > }
>
>
> Is there anything better we can do? I know there's a fair bit of cruft
> in; but these assumptions that SCHED_LOAD_SCALE is one task, just bug
> the hell out of me.
>
> Would it make sense to make load_balance()->detach_tasks() ensure
> busiest->sum_nr_running >= busiest->group_weight or so?

That would make a lot of sense, because that is essentially what the
code does for non-SMT systems. The story is a bit different for SMT
though since:

busiest->group_capacity < busiest->group_weight * SCHED_LOAD_SCALE.

This could lead us to try pulling more tasks to vacate SMT hw-threads.
However, I'm not convinced if it has any effect as:

max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);

is the _minimum_ of the actual load difference and the 'excess' load, so
load_above_capacity can't cause us to pull more load anyway. So I don't
see why it shouldn't be okay.

Something like the below should implement your proposal I think (only
boot tested on arm64):

---8<---

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fe30e6..37fcbec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5777,6 +5777,7 @@ struct lb_env {
int new_dst_cpu;
enum cpu_idle_type idle;
long imbalance;
+ long imbalance_tasks;
/* The set of CPUs under consideration for load-balancing */
struct cpumask *cpus;

@@ -6018,7 +6019,7 @@ static int detach_tasks(struct lb_env *env)

lockdep_assert_held(&env->src_rq->lock);

- if (env->imbalance <= 0)
+ if (env->imbalance <= 0 || env->imbalance_tasks == 0)
return 0;

while (!list_empty(tasks)) {
@@ -6072,9 +6073,9 @@ static int detach_tasks(struct lb_env *env)

/*
* We only want to steal up to the prescribed amount of
- * weighted load.
+ * weighted load or max number of tasks.
*/
- if (env->imbalance <= 0)
+ if (env->imbalance <= 0 || env->imbalance_tasks <= detached)
break;

continue;
@@ -6873,12 +6874,14 @@ void fix_small_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
*/
static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *sds)
{
- unsigned long max_pull, load_above_capacity = ~0UL;
+ unsigned long max_pull;
struct sg_lb_stats *local, *busiest;

local = &sds->local_stat;
busiest = &sds->busiest_stat;

+ env->imbalance_tasks = ~0UL;
+
if (busiest->group_type == group_imbalanced) {
/*
* In the group_imb case we cannot rely on group-wide averages
@@ -6904,23 +6907,17 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
*/
if (busiest->group_type == group_overloaded &&
local->group_type == group_overloaded) {
- load_above_capacity = busiest->sum_nr_running *
- SCHED_LOAD_SCALE;
- if (load_above_capacity > busiest->group_capacity)
- load_above_capacity -= busiest->group_capacity;
- else
- load_above_capacity = ~0UL;
+ if (busiest->sum_nr_running > busiest->group_weight)
+ env->imbalance_tasks =
+ busiest->sum_nr_running - busiest->group_weight;
}

/*
* We're trying to get all the cpus to the average_load, so we don't
* want to push ourselves above the average load, nor do we wish to
- * reduce the max loaded cpu below the average load. At the same time,
- * we also don't want to reduce the group load below the group capacity
- * (so that we can implement power-savings policies etc). Thus we look
- * for the minimum possible imbalance.
+ * reduce the max loaded cpu below the average load.
*/
- max_pull = min(busiest->avg_load - sds->avg_load, load_above_capacity);
+ max_pull = busiest->avg_load - sds->avg_load;

/* How much load to actually move to equalise the imbalance */
env->imbalance = min(
--
1.9.1