[RFC][PATCH 5/6] x86/topo: Fix SNC topology mess

From: Peter Zijlstra

Date: Thu Feb 26 2026 - 05:57:06 EST


So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")

The original crazy SNC-3 SLIT table was:

node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10

And per:

https://lore.kernel.org/lkml/20250825075642.GQ3245006@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/

My suggestion was to average the off-trace clusters to restore sanity.

However, 4d6dd05d07d0 implements this under various assumptions:

- there will never be more than 2 packages;
- the off-trace cluster will have distance >20

And then HPE shows up with a machine that matches the
Vendor-Family-Model checks but looks like this:

Here's an 8 socket (2 chassis) HPE system with SNC enabled:

node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10

10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
40 = Different chassis

*However* this is SNC-2.

This completely invalidates all the earlier assumptions and trips
WARNs.

Now that the topology code has a sensible measure of
nodes-per-package, we can use that to divinate the SNC mode at hand,
and only fix up SNC-3 topologies.

With the only assumption that there are no CPU-less nodes -- is this
a valid assumption ?

Fixes: 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
---
arch/x86/kernel/smpboot.c | 64 +++++++++++++++++-----------------------------
1 file changed, 25 insertions(+), 39 deletions(-)

--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -506,33 +506,32 @@ static void __init build_sched_topology(
}

#ifdef CONFIG_NUMA
-static int sched_avg_remote_distance;
-static int avg_remote_numa_distance(void)
+static int slit_cluster_distance(int i, int j)
{
- int i, j;
- int distance, nr_remote, total_distance;
-
- if (sched_avg_remote_distance > 0)
- return sched_avg_remote_distance;
-
- nr_remote = 0;
- total_distance = 0;
- for_each_node_state(i, N_CPU) {
- for_each_node_state(j, N_CPU) {
- distance = node_distance(i, j);
-
- if (distance >= REMOTE_DISTANCE) {
- nr_remote++;
- total_distance += distance;
- }
+ int u = __num_nodes_per_package;
+ long d = 0;
+ int x, y;
+
+ /*
+ * Is this a unit cluster on the trace?
+ */
+ if ((i / u) == (j / u))
+ return node_distance(i, j);
+
+ /*
+ * Off-trace cluster, return average of the cluster to force symmetry.
+ */
+ x = i - (i % u);
+ y = j - (j % u);
+
+ for (i = x; i < x + u; i++) {
+ for (j = y; j < y + u; j++) {
+ d += node_distance(i, j);
+ d += node_distance(j, i);
}
}
- if (nr_remote)
- sched_avg_remote_distance = total_distance / nr_remote;
- else
- sched_avg_remote_distance = REMOTE_DISTANCE;

- return sched_avg_remote_distance;
+ return d / (2*u*u);
}

int arch_sched_node_distance(int from, int to)
@@ -542,13 +541,11 @@ int arch_sched_node_distance(int from, i
switch (boot_cpu_data.x86_vfm) {
case INTEL_GRANITERAPIDS_X:
case INTEL_ATOM_DARKMONT_X:
-
- if (topology_max_packages() == 1 || __num_nodes_per_package == 1 ||
- d < REMOTE_DISTANCE)
+ if (topology_max_packages() == 1 || __num_nodes_per_package < 3)
return d;

/*
- * With SNC enabled, there could be too many levels of remote
+ * With SNC-3 enabled, there could be too many levels of remote
* NUMA node distances, creating NUMA domain levels
* including local nodes and partial remote nodes.
*
@@ -557,19 +554,8 @@ int arch_sched_node_distance(int from, i
* in the remote package in the same sched group.
* Simplify NUMA domains and avoid extra NUMA levels including
* different remote NUMA nodes and local nodes.
- *
- * GNR and CWF don't expect systems with more than 2 packages
- * and more than 2 hops between packages. Single average remote
- * distance won't be appropriate if there are more than 2
- * packages as average distance to different remote packages
- * could be different.
*/
- WARN_ONCE(topology_max_packages() > 2,
- "sched: Expect only up to 2 packages for GNR or CWF, "
- "but saw %d packages when building sched domains.",
- topology_max_packages());
-
- d = avg_remote_numa_distance();
+ return slit_cluster_distance(from, to);
}
return d;
}