Re: [PATCH 3/3] sched,numa: do not set preferred_node on migration to a second choice node

From: Rik van Riel
Date: Tue Apr 15 2014 - 10:36:08 EST


On 04/14/2014 08:56 AM, Peter Zijlstra wrote:
On Fri, Apr 11, 2014 at 01:00:29PM -0400, riel@xxxxxxxxxx wrote:
From: Rik van Riel <riel@xxxxxxxxxx>

Setting the numa_preferred_node for a task in task_numa_migrate
does nothing on a 2-node system. Either we migrate to the node
that already was our preferred node, or we stay where we were.

On a 4-node system, it can slightly decrease overhead, by not
calling the NUMA code as much. Since every node tends to be
directly connected to every other node, running on the wrong
node for a while does not do much damage.

However, on an 8 node system, there are far more bad nodes
than there are good ones, and pretending that a second choice
is actually the preferred node can greatly delay, or even
prevent, a workload from converging.

The only time we can safely pretend that a second choice
node is the preferred node is when the task is part of a
workload that spans multiple NUMA nodes.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
Tested-by: Vinod Chegu <chegu_vinod@xxxxxx>
---
kernel/sched/fair.c | 11 ++++++++++-
1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index babd316..302facf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1301,7 +1301,16 @@ static int task_numa_migrate(struct task_struct *p)
if (env.best_cpu == -1)
return -EAGAIN;

- sched_setnuma(p, env.dst_nid);
+ /*
+ * If the task is part of a workload that spans multiple NUMA nodes,
+ * and is migrating into one of the workload's active nodes, remember

I read 'into' as:
!node_isset(env.src_nid, ...) && node_isset(env.dst_nid, ...)

The code doesn't seem to do this.

s/into/to/ makes the comment and the code match
again :)

+ * this node as the task's preferred numa node, so the workload can
+ * settle down.
+ * A task that migrated to a second choice node will be better off
+ * trying for a better one later. Do not set the preferred node here.
+ */
+ if (p->numa_group && node_isset(env.dst_nid, p->numa_group->active_nodes))
+ sched_setnuma(p, env.dst_nid);

OK, so I was totally confused on this one.

What I missed was that we set the primary choice over in
task_numa_placement().

I'm not really happy with the changelog; but I'm also struggling to
identify what exactly is missing. Or rather, the thing makes me
confused, and not feel like it actually explains it proper.

That said; I tend to more or less agree with the actual change, but..

I have looked at the comment and the changelog some
more, and it is not clear to me what you are missing,
or what I could be explaining better...


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/