[PATCH 1/6] sched/numa: Stop multiple tasks from moving to the cpu at the same time

From: Srikar Dronamraju
Date: Fri Aug 03 2018 - 02:14:53 EST


Task migration under numa balancing can happen in parallel. More than
one task might choose to migrate to the same cpu at the same time. This
can result in
- During task swap, choosing a task that was not part of the evaluation.
- During task swap, task which just got moved into its preferred node,
moving to a completely different node.
- During task swap, task failing to move to the preferred node, will have
to wait an extra interval for the next migrate opportunity.
- During task movement, multiple task movements can cause load imbalance.

This problem is more likely if there are more cores per node or more
nodes in the system.

Use a per run-queue variable to check if numa-balance is active on the
run-queue.

specjbb2005 / bops/JVM / higher bops are better
on 2 Socket/2 Node Intel
JVMS Prev Current %Change
4 199709 206350 3.32534
1 330830 319963 -3.28477


on 2 Socket/4 Node Power8 (PowerNV)
JVMS Prev Current %Change
8 89011.9 89627.8 0.69193
1 218946 211338 -3.47483


on 2 Socket/2 Node Power9 (PowerNV)
JVMS Prev Current %Change
4 180473 186539 3.36117
1 212805 220344 3.54268


on 4 Socket/4 Node Power7
JVMS Prev Current %Change
8 56941.8 56836 -0.185804
1 111686 112970 1.14965


dbench / transactions / higher numbers are better
on 2 Socket/2 Node Intel
count Min Max Avg Variance %Change
5 12029.8 12124.6 12060.9 34.0076
5 13136.1 13170.2 13150.2 14.7482 9.03166


on 2 Socket/4 Node Power8 (PowerNV)
count Min Max Avg Variance %Change
5 4968.51 5006.62 4981.31 13.4151
5 4319.79 4998.19 4836.53 261.109 -2.90646


on 2 Socket/2 Node Power9 (PowerNV)
count Min Max Avg Variance %Change
5 9342.92 9381.44 9363.92 12.8587
5 9325.56 9402.7 9362.49 25.9638 -0.0152714


on 4 Socket/4 Node Power7
count Min Max Avg Variance %Change
5 143.4 188.892 170.225 16.9929
5 132.581 191.072 170.554 21.6444 0.193274

Acked-by: Mel Gorman <mgorman@xxxxxxxxxxxxxxxxxxx>
Reviewed-by: Rik van Riel <riel@xxxxxxxxxxx>
Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>

Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
---
Changelog v2->v3:
Add comments as requested by Peter.

kernel/sched/fair.c | 22 ++++++++++++++++++++++
kernel/sched/sched.h | 1 +
2 files changed, 23 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 309c93f..5cf921a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1514,6 +1514,21 @@ struct task_numa_env {
static void task_numa_assign(struct task_numa_env *env,
struct task_struct *p, long imp)
{
+ struct rq *rq = cpu_rq(env->dst_cpu);
+
+ /* Bail out if run-queue part of active numa balance. */
+ if (xchg(&rq->numa_migrate_on, 1))
+ return;
+
+ /*
+ * Clear previous best_cpu/rq numa-migrate flag, since task now
+ * found a better cpu to move/swap.
+ */
+ if (env->best_cpu != -1) {
+ rq = cpu_rq(env->best_cpu);
+ WRITE_ONCE(rq->numa_migrate_on, 0);
+ }
+
if (env->best_task)
put_task_struct(env->best_task);
if (p)
@@ -1569,6 +1584,9 @@ static void task_numa_compare(struct task_numa_env *env,
long moveimp = imp;
int dist = env->dist;

+ if (READ_ONCE(dst_rq->numa_migrate_on))
+ return;
+
rcu_read_lock();
cur = task_rcu_dereference(&dst_rq->curr);
if (cur && ((cur->flags & PF_EXITING) || is_idle_task(cur)))
@@ -1710,6 +1728,7 @@ static int task_numa_migrate(struct task_struct *p)
.best_cpu = -1,
};
struct sched_domain *sd;
+ struct rq *best_rq;
unsigned long taskweight, groupweight;
int nid, ret, dist;
long taskimp, groupimp;
@@ -1811,14 +1830,17 @@ static int task_numa_migrate(struct task_struct *p)
*/
p->numa_scan_period = task_scan_start(p);

+ best_rq = cpu_rq(env.best_cpu);
if (env.best_task == NULL) {
ret = migrate_task_to(p, env.best_cpu);
+ WRITE_ONCE(best_rq->numa_migrate_on, 0);
if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, env.best_cpu);
return ret;
}

ret = migrate_swap(p, env.best_task, env.best_cpu, env.src_cpu);
+ WRITE_ONCE(best_rq->numa_migrate_on, 0);

if (ret != 0)
trace_sched_stick_numa(p, env.src_cpu, task_cpu(env.best_task));
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4a2e8ca..0b91612 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -783,6 +783,7 @@ struct rq {
#ifdef CONFIG_NUMA_BALANCING
unsigned int nr_numa_running;
unsigned int nr_preferred_running;
+ unsigned int numa_migrate_on;
#endif
#define CPU_LOAD_IDX_MAX 5
unsigned long cpu_load[CPU_LOAD_IDX_MAX];
--
1.8.3.1