Re: [PATCH RFC 4/5] sched,numa: calculate node scores in complex NUMA topologies

From: Rik van Riel
Date: Mon Oct 13 2014 - 03:15:56 EST


On 10/12/2014 10:53 AM, Peter Zijlstra wrote:
On Wed, Oct 08, 2014 at 03:37:29PM -0400, riel@xxxxxxxxxx wrote:
From: Rik van Riel <riel@xxxxxxxxxx>

In order to do task placement on systems with complex NUMA topologies,
it is necessary to count the faults on nodes nearby the node that is
being examined for a potential move.

In case of a system with a backplane interconnect, we are dealing with
groups of NUMA nodes; each of the nodes within a group is the same number
of hops away from nodes in other groups in the system. Optimal placement
on this topology is achieved by counting all nearby nodes equally. When
comparing nodes A and B at distance N, nearby nodes are those at distances
smaller than N from nodes A or B.

Placement strategy on a system with a glueless mesh NUMA topology needs
to be different, because there are no natural groups of nodes determined
by the hardware. Instead, when dealing with two nodes A and B at distance
N, N >= 2, there will be intermediate nodes at distance < N from both nodes
A and B. Good placement can be achieved by right shifting the faults on
nearby nodes by the number of hops from the node being scored. In this
context, a nearby node is any node less than the maximum distance in the
system away from the node. Those nodes are skipped for efficiency reasons,
there is no real policy reason to do so.


+/* Handle placement on systems where not all nodes are directly connected. */
+static unsigned long score_nearby_nodes(struct task_struct *p, int nid,
+ int hoplimit, bool task)
+{
+ unsigned long score = 0;
+ int node;
+
+ /*
+ * All nodes are directly connected, and the same distance
+ * from each other. No need for fancy placement algorithms.
+ */
+ if (sched_numa_topology_type == NUMA_DIRECT)
+ return 0;
+
+ for_each_online_node(node) {

+ }
+
+ return score;
+}

@@ -944,6 +1003,8 @@ static inline unsigned long task_weight(struct task_struct *p, int nid,
return 0;

faults = task_faults(p, nid);
+ faults += score_nearby_nodes(p, nid, hops, true);
+
return 1000 * faults / total_faults;
}

@@ -961,6 +1022,8 @@ static inline unsigned long group_weight(struct task_struct *p, int nid,
return 0;

faults = group_faults(p, nid);
+ faults += score_nearby_nodes(p, nid, hops, false);
+
return 1000 * faults / total_faults;
}

So this makes {task,group}_weight() O(nr_nodes), and we call these
function from O(nr_nodes) loops, giving a total of O(nr_nodes^2)
computational complexity, right?

Seems important to mention; I realize this is only for !DIRECT, but
still, I bet the real large people (those same 512 nodes we had
previous) would not really appreciate this.

If desired, we can set up a nodemask for each node that
covers only the nodes that are less than the maximum
distance away from each other.

Even on the very large systems, that is likely to be
less than a dozen nodes.

I am not sure whether to add that complexity now, or
whether that should be considered premature optimization :)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/