# Re: [PATCH 14/14] sched/topology: Add a few comments

**From: **Peter Zijlstra

**Date: ** Mon May 01 2017 - 04:35:27 EST

On Fri, Apr 28, 2017 at 03:20:12PM +0200, Peter Zijlstra wrote:

>* +/**

>* + * NUMA topology (first read the regular topology blurb below)*

>* + **

>* + * Given a node-distance table, for example:*

>* + **

>* + * node 0 1 2 3*

>* + * 0: 10 20 30 20*

>* + * 1: 20 10 20 30*

>* + * 2: 30 20 10 20*

>* + * 3: 20 30 20 10*

>* + **

>* + * which represents a 4 node ring topology like:*

>* + **

>* + * 0 ----- 1*

>* + * | |*

>* + * | |*

>* + * | |*

>* + * 3 ----- 2*

>* + **

>* + * We want to construct domains and groups to represent this. The way we go*

>* + * about doing this is to build the domains on 'hops'. For each NUMA level we*

>* + * construct the mask of all nodes reachable in @level hops.*

>* + **

>* + * For the above NUMA topology that gives 3 levels:*

>* + **

>* + * NUMA-2 0-3 0-3 0-3 0-3*

>* + * groups: {0-1,3},{1-3} {0-2},{0,2-3} {1-3},{0-1,3} {0,2-3},{0-2}*

>* + **

>* + * NUMA-1 0-1,3 0-2 1-3 0,2-3*

>* + * groups: {0},{1},{3} {0},{1},{2} {1},{2},{3} {0},{2},{3}*

>* + **

>* + * NUMA-0 0 1 2 3*

>* + **

>* + **

>* + * As can be seen; things don't nicely line up as with the regular topology.*

>* + * When we iterate a domain in child domain chunks some nodes can be*

>* + * represented multiple times -- hence the "overlap" naming for this part of*

>* + * the topology.*

>* + **

>* + * In order to minimize this overlap, we only build enough groups to cover the*

>* + * domain. For instance Node-0 NUMA-2 would only get groups: 0-1,3 and 1-3.*

>* + **

>* + * Because:*

>* + **

>* + * - the first group of each domain is its child domain; this*

>* + * gets us the first 0-1,3*

>* + * - the only uncovered node is 2, who's child domain is 1-3.*

>* + **

>* + * However, because of the overlap, computing a unique CPU for each group is*

>* + * more complicated. Consider for instance the groups of NODE-1 NUMA-2, both*

>* + * groups include the CPUs of Node-0, while those CPUs would not in fact ever*

>* + * end up at those groups (they would end up in group: 0-1,3).*

>* + **

>* + * To correct this we have to introduce the group iteration mask. This mask*

>* + * will contain those CPUs in the group that can reach this group given the*

>* + * (child) domain tree.*

>* + **

>* + * With this we can once again compute balance_cpu and sched_group_capacity*

>* + * relations.*

>* + **

>* + * XXX include words on how balance_cpu is unique and therefore can be*

>* + * used for sched_group_capacity links.*

>* + */*

>* +*

I added the below to clarify some of the asymmetric comments we have.

--- a/kernel/sched/topology.c

+++ b/kernel/sched/topology.c

@@ -571,6 +571,37 @@ int group_balance_cpu(struct sched_group

*

* XXX include words on how balance_cpu is unique and therefore can be

* used for sched_group_capacity links.

+ *

+ *

+ * Another 'interesting' topology is:

+ *

+ * node 0 1 2 3

+ * 0: 10 20 20 30

+ * 1: 20 10 20 20

+ * 2: 20 20 10 20

+ * 3: 30 20 20 10

+ *

+ * Which looks a little like:

+ *

+ * 0 ----- 1

+ * | / |

+ * | / |

+ * | / |

+ * 2 ----- 3

+ *

+ * This topology is asymmetric, nodes 1,2 are fully connected, but nodes 0,3

+ * are not.

+ *

+ * This leads to a few particularly weird cases where the sched_domain's are

+ * not of the same number for each cpu. Consider:

+ *

+ * NUMA-2 0-3 0-3

+ * groups: {0-2},{1-3} {1-3},{0-2}

+ *

+ * NUMA-1 0-2 0-3 0-3 1-3

+ *

+ * NUMA-0 0 1 2 3

+ *

*/