Re: [PATCH v2] sched/topology: Check average distances to remote packages
From: Kyle Meyer
Date: Tue Feb 24 2026 - 21:09:54 EST
On Mon, Feb 23, 2026 at 06:03:14PM +0100, Peter Zijlstra wrote:
> On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote:
> > Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
> > remote packages to fix scheduler domains, see [1] for more information.
> >
> > A warning and backtrace are printed when sub-NUMA clustering (SNC) is
> > enabled and there are more than 2 packages because the average distances
> > to remote packages could be different, skewing the single average remote
> > distance.
>
> But earlier Tim said these systems will not have more than 2 packages.
> So what's what?
We have Intel customer reference boards with 2, 4, and 8 sockets.
> So what do these new systems look like?
Here's an 8 socket (2 chassis) HPE system with SNC enabled:
node 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0: 10 12 16 16 16 16 18 18 40 40 40 40 40 40 40 40
1: 12 10 16 16 16 16 18 18 40 40 40 40 40 40 40 40
2: 16 16 10 12 18 18 16 16 40 40 40 40 40 40 40 40
3: 16 16 12 10 18 18 16 16 40 40 40 40 40 40 40 40
4: 16 16 18 18 10 12 16 16 40 40 40 40 40 40 40 40
5: 16 16 18 18 12 10 16 16 40 40 40 40 40 40 40 40
6: 18 18 16 16 16 16 10 12 40 40 40 40 40 40 40 40
7: 18 18 16 16 16 16 12 10 40 40 40 40 40 40 40 40
8: 40 40 40 40 40 40 40 40 10 12 16 16 16 16 18 18
9: 40 40 40 40 40 40 40 40 12 10 16 16 16 16 18 18
10: 40 40 40 40 40 40 40 40 16 16 10 12 18 18 16 16
11: 40 40 40 40 40 40 40 40 16 16 12 10 18 18 16 16
12: 40 40 40 40 40 40 40 40 16 16 18 18 10 12 16 16
13: 40 40 40 40 40 40 40 40 16 16 18 18 12 10 16 16
14: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 10 12
15: 40 40 40 40 40 40 40 40 18 18 16 16 16 16 12 10
10 = Same chassis and socket
12 = Same chassis and socket (SNC)
16 = Same chassis and adjacent socket
18 = Same chassis and non-adjacent socket
40 = Different chassis
Each processor connects to an ASIC (XNC) that acts as a multiplexer, extending
the UPI interconnect across the entire system.
We don't experience the scheduler domain issue reported by Tim because our SLIT
provides symmetric distances to remote NUMA nodes, but we trigger the WARN_ONCE
because we exceed 2 packages.
> > This is unnecessary when the average distances to remote packages are
> > the same.
> >
> > Support single average remote distance on systems with more than 2
> > packages, preventing unnecessary warnings and backtraces, by checking if
> > average distances to remote packages are the same.
>
>
>
> > ---
> > arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++-----------
> > 1 file changed, 50 insertions(+), 19 deletions(-)
> >
> > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> > index 5cd6950ab672..dc8f15bd2e19 100644
> > --- a/arch/x86/kernel/smpboot.c
> > +++ b/arch/x86/kernel/smpboot.c
> > @@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void)
> > {
> > int i, j;
> > int distance, nr_remote, total_distance;
> > + int max_pkgs = topology_max_packages();
> > + int cpu, pkg, pkg_avg_distance;
> > + int *pkg_total_distance = NULL, *pkg_nr_remote = NULL;
>
> Can you make that the normal reverse xmas thing?
Yes.
> > if (sched_avg_remote_distance > 0)
> > return sched_avg_remote_distance;
> >
> > + sched_avg_remote_distance = REMOTE_DISTANCE;
> > +
> > nr_remote = 0;
> > total_distance = 0;
> > +
> > + pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> > + if (!pkg_total_distance)
> > + goto cleanup;
> > +
> > + pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> > + if (!pkg_nr_remote)
> > + goto cleanup;
> > +
> > for_each_node_state(i, N_CPU) {
> > for_each_node_state(j, N_CPU) {
> > distance = node_distance(i, j);
> >
> > - if (distance >= REMOTE_DISTANCE) {
> > - nr_remote++;
> > - total_distance += distance;
> > - }
> > + if (distance < REMOTE_DISTANCE)
> > + continue;
> > +
> > + nr_remote++;
> > + total_distance += distance;
> > +
> > + cpu = cpumask_first(cpumask_of_node(j));
> > + if (cpu >= nr_cpu_ids)
> > + continue;
> > +
> > + pkg = topology_physical_package_id(cpu);
> > + pkg_total_distance[pkg] += distance;
> > + pkg_nr_remote[pkg]++;
>
> This is broken, physical_package_id is not guaranteed to be dense.
Thank you, I'll fix this.
> > }
> > }
> > - if (nr_remote)
> > - sched_avg_remote_distance = total_distance / nr_remote;
> > - else
> > - sched_avg_remote_distance = REMOTE_DISTANCE;
> >
> > + if (!nr_remote)
> > + goto cleanup;
> > +
> > + sched_avg_remote_distance = total_distance / nr_remote;
> > +
> > + /*
> > + * Single average remote distance won't be appropriate if different
> > + * packages have different distances to remote packages.
> > + */
> > + for (i = 0; i < max_pkgs; i++) {
> > + if (!pkg_nr_remote[i])
> > + continue;
> > +
> > + pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i];
> > +
> > + pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance);
> > +
> > + if (pkg_avg_distance != sched_avg_remote_distance)
> > + WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n");
> > + }
>
> This is pretty yuck.
>
> Also, what's with the pr_debug() stuff?
>
> Anyway, that function was fairly magical, and now it is nearly
> impenetrable. If we want this, it needs comments. Definitely more
> comments, with nice pictures on.
OK, thank you for the feedback, I'll work on a v3.
Thanks,
Kyle Meyer