Re: [PATCH v2] sched/topology: Check average distances to remote packages
From: Peter Zijlstra
Date: Wed Feb 25 2026 - 07:34:53 EST
On Tue, Feb 24, 2026 at 07:43:10PM -0600, Kyle Meyer wrote:
> Here's an 8 socket (2 chassis) HPE system with SNC enabled:
>
> node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
> 0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
> 1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
> 2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
> 3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
> 4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
> 5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
> 6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
> 7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
> 8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
> 9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
> 10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
> 11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
> 12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
> 13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
> 14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
> 15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10
>
> 10 = Same chassis and socket
> 12 = Same chassis and socket (SNC)
> 16 = Same chassis and adjacent socket
> 18 = Same chassis and non-adjacent socket
> 40 = Different chassis
>
> Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
> the UPI interconnect across the entire system.
>
> We don't experience the scheduler domain issue reported by Tim because our SLIT
> provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
> because we exceed 2 packages.
The original case was for SNC-3, the above looks to be SNC-2. Does your
system also support SNC-3?
Anyway, yes your SLIT table looks sane (unlike that SNC-3 monster Tim
showed earlier).
And it also shows that using REMOTE_DISTANCE (20) was completely random
and 'wrong'.
So per 4d6dd05d07d0 ("sched/topology: Fix sched domain build error for GNR, CWF in SNC-3 mode")
Tim's original crazy SNC-3 SLIT table was:
node distances:
node 0 1 2 3 4 5
0: 10 15 17 21 28 26
1: 15 10 15 23 26 23
2: 17 15 10 26 23 21
3: 21 28 26 10 15 17
4: 23 26 23 15 10 15
5: 26 23 21 17 15 10
And per:
https://lore.kernel.org/lkml/20250825075642.GQ3245006@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
My suggestion was to average the off-trace clusters to restore sanity.
So how about we go about implementing that without reference to magical
numbers, something like so. This obviously needs a little TLC, but it
might just work.
Hmm?
---
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 5cd6950ab672..cba3e4b14250 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -513,33 +513,55 @@ static void __init build_sched_topology(void)
}
#ifdef CONFIG_NUMA
-static int sched_avg_remote_distance;
-static int avg_remote_numa_distance(void)
+
+/*
+ * Find the largest symmetric cluster in an attempt to identify the unit size.
+ *
+ * XXX doesn't respect N_CPU node classes and such.
+ */
+static int slit_cluster_size(void)
{
- int i, j;
- int distance, nr_remote, total_distance;
+ int i, j, n, m = num_possible_nodes();
- if (sched_avg_remote_distance > 0)
- return sched_avg_remote_distance;
-
- nr_remote = 0;
- total_distance = 0;
- for_each_node_state(i, N_CPU) {
- for_each_node_state(j, N_CPU) {
- distance = node_distance(i, j);
-
- if (distance >= REMOTE_DISTANCE) {
- nr_remote++;
- total_distance += distance;
+ for (n = 2; n < m; n++) {
+ for (i = 0; i < n; i++) {
+ for (j = i; j < n; j++) {
+ if (node_distance(i, j) != node_distance(j, i))
+ return n - 1;
}
}
}
- if (nr_remote)
- sched_avg_remote_distance = total_distance / nr_remote;
- else
- sched_avg_remote_distance = REMOTE_DISTANCE;
- return sched_avg_remote_distance;
+ return m;
+}
+
+static int slit_cluster_distance(int i, int j)
+{
+ static int u = 0;
+ long d = 0;
+ int x, y;
+
+ if (!u)
+ u = slit_cluster_size();
+
+ /*
+ * Is this a unit cluster on the trace?
+ */
+ if ((i / u) == (j / u))
+ return node_distance(i, j);
+
+ /*
+ * Off-trace cluster, return average of the cluster to force symmetry.
+ */
+ x = i - (i % u);
+ y = j - (j % u);
+
+ for (i = x; i < x + u; i++) {
+ for (j = y; j < y + u; j++)
+ d += node_distance(i, j);
+ }
+
+ return d / (u*u);
}
int arch_sched_node_distance(int from, int to)
@@ -550,8 +572,7 @@ int arch_sched_node_distance(int from, int to)
case INTEL_GRANITERAPIDS_X:
case INTEL_ATOM_DARKMONT_X:
- if (!x86_has_numa_in_package || topology_max_packages() == 1 ||
- d < REMOTE_DISTANCE)
+ if (!x86_has_numa_in_package || topology_max_packages() == 1)
return d;
/*
@@ -571,12 +592,7 @@ int arch_sched_node_distance(int from, int to)
* packages as average distance to different remote packages
* could be different.
*/
- WARN_ONCE(topology_max_packages() > 2,
- "sched: Expect only up to 2 packages for GNR or CWF, "
- "but saw %d packages when building sched domains.",
- topology_max_packages());
-
- d = avg_remote_numa_distance();
+ return slit_cluster_distance(from, to);
}
return d;
}