Re: [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing

From: Dietmar Eggemann
Date: Tue Mar 24 2015 - 14:48:01 EST


On 24/03/15 15:26, Peter Zijlstra wrote:
On Wed, Feb 04, 2015 at 06:31:21PM +0000, Morten Rasmussen wrote:
From: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>

Energy-aware load balancing bases on cpu usage so the upper bound of its
operational range is a fully utilized cpu. Above this tipping point it
makes more sense to use weighted_cpuload to preserve smp_nice.
This patch implements the tipping point detection in update_sg_lb_stats
as if one cpu is over-utilized the current energy-aware load balance
operation will fall back into the conventional weighted load based one.

cc: Ingo Molnar <mingo@xxxxxxxxxx>
cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
---
kernel/sched/fair.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b79603..4849bad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6723,6 +6723,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
sgs->sum_weighted_load += weighted_cpuload(i);
if (idle_cpu(i))
sgs->idle_cpus++;
+
+ /* If cpu is over-utilized, bail out of ea */
+ if (env->use_ea && cpu_overutilized(i, env->sd))
+ env->use_ea = false;
}

I don't immediately see why this is desired. Why would a single
overloaded CPU be reason to quit? It could be the cpus simply aren't
'balanced' right and the group as a whole is still under utilized.

We want to play it safe here.

E.g. in a >2 cluster system, this over-utilized cpu could run >1 high priority tasks on a cluster with energy efficient cpus and this cluster could still not be the lb src on DIE level because a not over-utilized cluster with less energy-efficient cpus (burning more energy) could be chosen instead. We could construct cases where the other cpus in this energy efficient cluster can't help the over-utilized cpu during lb on MC level.

I can see that using per-cpu data in code which deals w/ sg's is against the sd scalability design where we should rely on per-sg and not per-cpu data though.

By bailing out in such a scenario we at least guarantee smpnice provided by conv. CFS.

We could also favor an sg with an over-utilized cpu to become the src but which one do we pick if there're multiple potential src sg's w/ an over-utilized cpu?


In that case we want to continue the balance pass to reach this
equilibrium.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/