[tip:sched/core] sched/numa: Only consider less busy nodes as numa balancing destinations

From: tip-bot for Rik van Riel
Date: Sun Jun 07 2015 - 13:48:53 EST


Commit-ID: 6f9aad0bc37286c0441b57f0ba8cffee50715426
Gitweb: http://git.kernel.org/tip/6f9aad0bc37286c0441b57f0ba8cffee50715426
Author: Rik van Riel <riel@xxxxxxxxxx>
AuthorDate: Thu, 28 May 2015 09:52:49 -0400
Committer: Ingo Molnar <mingo@xxxxxxxxxx>
CommitDate: Sun, 7 Jun 2015 15:57:45 +0200

sched/numa: Only consider less busy nodes as numa balancing destinations

Changeset a43455a1d572 ("sched/numa: Ensure task_numa_migrate() checks
the preferred node") fixes an issue where workloads would never
converge on a fully loaded (or overloaded) system.

However, it introduces a regression on less than fully loaded systems,
where workloads converge on a few NUMA nodes, instead of properly
staying spread out across the whole system. This leads to a reduction
in available memory bandwidth, and usable CPU cache, with predictable
performance problems.

The root cause appears to be an interaction between the load balancer
and NUMA balancing, where the short term load represented by the load
balancer differs from the long term load the NUMA balancing code would
like to base its decisions on.

Simply reverting a43455a1d572 would re-introduce the non-convergence
of workloads on fully loaded systems, so that is not a good option. As
an aside, the check done before a43455a1d572 only applied to a task's
preferred node, not to other candidate nodes in the system, so the
converge-on-too-few-nodes problem still happens, just to a lesser
degree.

Instead, try to compensate for the impedance mismatch between the load
balancer and NUMA balancing by only ever considering a lesser loaded
node as a destination for NUMA balancing, regardless of whether the
task is trying to move to the preferred node, or to another node.

This patch also addresses the issue that a system with a single
runnable thread would never migrate that thread to near its memory,
introduced by 095bebf61a46 ("sched/numa: Do not move past the balance
point if unbalanced").

A test where the main thread creates a large memory area, and spawns a
worker thread to iterate over the memory (placed on another node by
select_task_rq_fair), after which the main thread goes to sleep and
waits for the worker thread to loop over all the memory now sees the
worker thread migrated to where the memory is, instead of having all
the memory migrated over like before.

Jirka has run a number of performance tests on several systems: single
instance SpecJBB 2005 performance is 7-15% higher on a 4 node system,
with higher gains on systems with more cores per socket.
Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
little or no changes with the revert of 095bebf61a46 and this patch.

Reported-by: Artem Bityutski <dedekind1@xxxxxxxxx>
Reported-by: Jirka Hladky <jhladky@xxxxxxxxxx>
Tested-by: Jirka Hladky <jhladky@xxxxxxxxxx>
Tested-by: Artem Bityutskiy <dedekind1@xxxxxxxxx>
Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Acked-by: Mel Gorman <mgorman@xxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: H. Peter Anvin <hpa@xxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Link: http://lkml.kernel.org/r/20150528095249.3083ade0@xxxxxxxxxxxxxxxxxxxxx
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
---
kernel/sched/fair.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 723d69e..4b6e5f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1398,6 +1398,30 @@ static void task_numa_find_cpu(struct task_numa_env *env,
}
}

+/* Only move tasks to a NUMA node less busy than the current node. */
+static bool numa_has_capacity(struct task_numa_env *env)
+{
+ struct numa_stats *src = &env->src_stats;
+ struct numa_stats *dst = &env->dst_stats;
+
+ if (src->has_free_capacity && !dst->has_free_capacity)
+ return false;
+
+ /*
+ * Only consider a task move if the source has a higher load
+ * than the destination, corrected for CPU capacity on each node.
+ *
+ * src->load dst->load
+ * --------------------- vs ---------------------
+ * src->compute_capacity dst->compute_capacity
+ */
+ if (src->load * dst->compute_capacity >
+ dst->load * src->compute_capacity)
+ return true;
+
+ return false;
+}
+
static int task_numa_migrate(struct task_struct *p)
{
struct task_numa_env env = {
@@ -1452,7 +1476,8 @@ static int task_numa_migrate(struct task_struct *p)
update_numa_stats(&env.dst_stats, env.dst_nid);

/* Try to find a spot on the preferred nid. */
- task_numa_find_cpu(&env, taskimp, groupimp);
+ if (numa_has_capacity(&env))
+ task_numa_find_cpu(&env, taskimp, groupimp);

/*
* Look at other nodes in these cases:
@@ -1483,7 +1508,8 @@ static int task_numa_migrate(struct task_struct *p)
env.dist = dist;
env.dst_nid = nid;
update_numa_stats(&env.dst_stats, env.dst_nid);
- task_numa_find_cpu(&env, taskimp, groupimp);
+ if (numa_has_capacity(&env))
+ task_numa_find_cpu(&env, taskimp, groupimp);
}
}

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/