Re: [PATCH v2] sched/topology: Check average distances to remote packages
From: Chen, Yu C
Date: Wed Feb 25 2026 - 04:06:30 EST
Hi Kyle,
On 2/25/2026 9:43 AM, Kyle Meyer wrote:
On Mon, Feb 23, 2026 at 06:03:14PM +0100, Peter Zijlstra wrote:
On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote:
Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
remote packages to fix scheduler domains, see [1] for more information.
A warning and backtrace are printed when sub-NUMA clustering (SNC) is
enabled and there are more than 2 packages because the average distances
to remote packages could be different, skewing the single average remote
distance.
But earlier Tim said these systems will not have more than 2 packages.
So what's what?
We have Intel customer reference boards with 2, 4, and 8 sockets.
Thanks for the info. We were not previously aware that the stock GNR platform
will scale up to 8 sockets.
So what do these new systems look like?
Here's an 8 socket (2 chassis) HPE system with SNC enabled:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10
10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
Previously, I thought that REMOTE_DISTANCE represents the
distance between two nodes on different sockets, but that does
not appear to be the case here. I could not find any definition
of “20 or double” in the SLIT section of the ACPI specification.
Thus, I assume this value of 20 is an artificial threshold. In my
view, checking whether all distances above 20 are identical really
depends on the specific platform. For example, if we have a distance
value such as:
22 = Same chassis and non-adjacent socket then applying the current
patch would trigger a warning regardless.
That said, since 20 is an artificial threshold, I have a tentative
idea:we could normalize the SLIT distances by sorting the slit_dist
values, finding the 75th percentile value, keeping all slit_dist values
below the 75th percentile unchanged, and treating all slit_dist values
above the 75th percentile as remote - assigning them the average remote
distance. This way, we could eliminate the arbitrary value of 20. But
that might require rewrite.. and for now it is ok to keep 20.
40 = Different chassis
Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
the UPI interconnect across the entire system.
We don't experience the scheduler domain issue reported by Tim because our SLIT
provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
because we exceed 2 packages.
This is unnecessary when the average distances to remote packages are
the same.
Support single average remote distance on systems with more than 2
packages, preventing unnecessary warnings and backtraces, by checking if
average distances to remote packages are the same.
[ ... ]
+ pkg = topology_physical_package_id(cpu);
+ pkg_total_distance[pkg] += distance;
+ pkg_nr_remote[pkg]++;
This is broken, physical_package_id is not guaranteed to be dense.
Thank you, I'll fix this.
It seems that topology_logical_package_id() can work here.
thanks,
Chenyu