[tip:sched/core] sched/numa: Do not move past the balance point if unbalanced

From: tip-bot for Rik van Riel
Date: Wed Feb 18 2015 - 12:11:08 EST


Commit-ID: 095bebf61a460ad7f6a45bb17ddbf3a9df2b4397
Gitweb: http://git.kernel.org/tip/095bebf61a460ad7f6a45bb17ddbf3a9df2b4397
Author: Rik van Riel <riel@xxxxxxxxxx>
AuthorDate: Tue, 3 Feb 2015 16:56:48 -0500
Committer: Ingo Molnar <mingo@xxxxxxxxxx>
CommitDate: Wed, 18 Feb 2015 16:18:00 +0100

sched/numa: Do not move past the balance point if unbalanced

There is a subtle interaction between the logic introduced in commit
e63da03639cc ("sched/numa: Allow task switch if load imbalance improves"),
the way the load balancer counts the load on each NUMA node, and the way
NUMA hinting faults are done.

Specifically, the load balancer only counts currently running tasks
in the load, while NUMA hinting faults may cause tasks to stop, if
the page is locked by another task.

This could cause all of the threads of a large single instance workload,
like SPECjbb2005, to migrate to the same NUMA node. This was possible
because occasionally they all fault on the same few pages, and only one
of the threads remains runnable. That thread can move to the process's
preferred NUMA node without making the imbalance worse, because nothing
else is running at that time.

The fix is to check the direction of the net moving of load, and to
refuse a NUMA move if it would cause the system to move past the point
of balance. In an unbalanced state, only moves that bring us closer
to the balance point are allowed.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: mgorman@xxxxxxx
Link: http://lkml.kernel.org/r/20150203165648.0e9ac692@xxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
---
kernel/sched/fair.c | 41 ++++++++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ce18f3..28cbaca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1196,9 +1196,11 @@ static void task_numa_assign(struct task_numa_env *env,
static bool load_too_imbalanced(long src_load, long dst_load,
struct task_numa_env *env)
{
- long imb, old_imb;
- long orig_src_load, orig_dst_load;
long src_capacity, dst_capacity;
+ long orig_src_load;
+ long load_a, load_b;
+ long moved_load;
+ long imb;

/*
* The load is corrected for the CPU capacity available on each node.
@@ -1211,30 +1213,39 @@ static bool load_too_imbalanced(long src_load, long dst_load,
dst_capacity = env->dst_stats.compute_capacity;

/* We care about the slope of the imbalance, not the direction. */
- if (dst_load < src_load)
- swap(dst_load, src_load);
+ load_a = dst_load;
+ load_b = src_load;
+ if (load_a < load_b)
+ swap(load_a, load_b);

/* Is the difference below the threshold? */
- imb = dst_load * src_capacity * 100 -
- src_load * dst_capacity * env->imbalance_pct;
+ imb = load_a * src_capacity * 100 -
+ load_b * dst_capacity * env->imbalance_pct;
if (imb <= 0)
return false;

/*
* The imbalance is above the allowed threshold.
- * Compare it with the old imbalance.
+ * Allow a move that brings us closer to a balanced situation,
+ * without moving things past the point of balance.
*/
orig_src_load = env->src_stats.load;
- orig_dst_load = env->dst_stats.load;

- if (orig_dst_load < orig_src_load)
- swap(orig_dst_load, orig_src_load);
-
- old_imb = orig_dst_load * src_capacity * 100 -
- orig_src_load * dst_capacity * env->imbalance_pct;
+ /*
+ * In a task swap, there will be one load moving from src to dst,
+ * and another moving back. This is the net sum of both moves.
+ * A simple task move will always have a positive value.
+ * Allow the move if it brings the system closer to a balanced
+ * situation, without crossing over the balance point.
+ */
+ moved_load = orig_src_load - src_load;

- /* Would this change make things worse? */
- return (imb > old_imb);
+ if (moved_load > 0)
+ /* Moving src -> dst. Did we overshoot balance? */
+ return src_load * dst_capacity < dst_load * src_capacity;
+ else
+ /* Moving dst -> src. Did we overshoot balance? */
+ return dst_load * src_capacity < src_load * dst_capacity;
}

/*
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/