From: Peter Zijlstra
Date: Mon May 01 2017 - 04:35:27 EST

On Fri, Apr 28, 2017 at 03:20:12PM +0200, Peter Zijlstra wrote:
> +/*
> + * NUMA topology (first read the regular topology blurb below)
> + *
> + * Given a node-distance table, for example:
> + *
> + * node 0 1 2 3
> + * 0: 10 20 30 20
> + * 1: 20 10 20 30
> + * 2: 30 20 10 20
> + * 3: 20 30 20 10
> + *
> + * which represents a 4 node ring topology like:
> + *
> + * 0 ----- 1
> + * | |
> + * | |
> + * | |
> + * 3 ----- 2
> + *
> + * We want to construct domains and groups to represent this. The way we go
> + * about doing this is to build the domains on 'hops'. For each NUMA level we
> + * construct the mask of all nodes reachable in @level hops.
> + *
> + * For the above NUMA topology that gives 3 levels:
> + *
> + * NUMA-2 0-3 0-3 0-3 0-3
> + * groups: {0-1,3},{1-3} {0-2},{0,2-3} {1-3},{0-1,3} {0,2-3},{0-2}
> + *
> + * NUMA-1 0-1,3 0-2 1-3 0,2-3
> + * groups: {0},{1},{3} {0},{1},{2} {1},{2},{3} {0},{2},{3}
> + *
> + * NUMA-0 0 1 2 3
> + *
> + *
> + * As can be seen; things don't nicely line up as with the regular topology.
> + * When we iterate a domain in child domain chunks some nodes can be
> + * represented multiple times -- hence the "overlap" naming for this part of
> + * the topology.
> + *
> + * In order to minimize this overlap, we only build enough groups to cover the
> + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.
> + *
> + * Because:
> + *
> + * - the first group of each domain is its child domain; this
> + * gets us the first 0-1,3
> + * - the only uncovered node is 2, who's child domain is 1-3.
> + *
> + * However, because of the overlap, computing a unique CPU for each group is
> + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both
> + * groups include the CPUs of Node-0, while those CPUs would not in fact ever
> + * end up at those groups (they would end up in group: 0-1,3).
> + *
> + * To correct this we have to introduce the group iteration mask. This mask
> + * will contain those CPUs in the group that can reach this group given the
> + * (child) domain tree.
> + *
> + * With this we can once again compute balance_cpu and sched_group_capacity
> + * relations.
> + *
> + * XXX include words on how balance_cpu is unique and therefore can be
> + * used for sched_group_capacity links.
> + */
> +

I added the below to clarify some of the asymmetric comments we have.

--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -571,6 +571,37 @@ int group_balance_cpu(struct sched_group
*
* XXX include words on how balance_cpu is unique and therefore can be
+ *
+ *
+ * Another 'interesting' topology is:
+ *
+ * node 0 1 2 3
+ * 0: 10 20 20 30
+ * 1: 20 10 20 20
+ * 2: 20 20 10 20
+ * 3: 30 20 20 10
+ *
+ * Which looks a little like:
+ *
+ * 0 ----- 1
+ * | / |
+ * | / |
+ * | / |
+ * 2 ----- 3
+ *
+ * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3
+ * are not.
+ *
+ * This leads to a few particularly weird cases where the sched_domain's are
+ * not of the same number for each cpu. Consider:
+ *
+ * NUMA-2 0-3 0-3
+ * groups: {0-2},{1-3} {1-3},{0-2}
+ *
+ * NUMA-1 0-2 0-3 0-3 1-3
+ *
+ * NUMA-0 0 1 2 3
+ *
*/