Re: [PATCH v2] sched/topology: Check average distances to remote packages
From: Peter Zijlstra
Date: Mon Feb 23 2026 - 12:07:22 EST
On Wed, Feb 04, 2026 at 06:24:26PM -0600, Kyle Meyer wrote:
> Granite Rapids (GNR) and Clearwater Forest (CWF) average distances to
> remote packages to fix scheduler domains, see [1] for more information.
>
> A warning and backtrace are printed when sub-NUMA clustering (SNC) is
> enabled and there are more than 2 packages because the average distances
> to remote packages could be different, skewing the single average remote
> distance.
But earlier Tim said these systems will not have more than 2 packages.
So what's what?
So what do these new systems look like?
> This is unnecessary when the average distances to remote packages are
> the same.
>
> Support single average remote distance on systems with more than 2
> packages, preventing unnecessary warnings and backtraces, by checking if
> average distances to remote packages are the same.
> ---
> arch/x86/kernel/smpboot.c | 69 ++++++++++++++++++++++++++++-----------
> 1 file changed, 50 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
> index 5cd6950ab672..dc8f15bd2e19 100644
> --- a/arch/x86/kernel/smpboot.c
> +++ b/arch/x86/kernel/smpboot.c
> @@ -518,27 +518,69 @@ static int avg_remote_numa_distance(void)
> {
> int i, j;
> int distance, nr_remote, total_distance;
> + int max_pkgs = topology_max_packages();
> + int cpu, pkg, pkg_avg_distance;
> + int *pkg_total_distance = NULL, *pkg_nr_remote = NULL;
Can you make that the normal reverse xmas thing?
> if (sched_avg_remote_distance > 0)
> return sched_avg_remote_distance;
>
> + sched_avg_remote_distance = REMOTE_DISTANCE;
> +
> nr_remote = 0;
> total_distance = 0;
> +
> + pkg_total_distance = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> + if (!pkg_total_distance)
> + goto cleanup;
> +
> + pkg_nr_remote = kcalloc(max_pkgs, sizeof(int), GFP_KERNEL);
> + if (!pkg_nr_remote)
> + goto cleanup;
> +
> for_each_node_state(i, N_CPU) {
> for_each_node_state(j, N_CPU) {
> distance = node_distance(i, j);
>
> - if (distance >= REMOTE_DISTANCE) {
> - nr_remote++;
> - total_distance += distance;
> - }
> + if (distance < REMOTE_DISTANCE)
> + continue;
> +
> + nr_remote++;
> + total_distance += distance;
> +
> + cpu = cpumask_first(cpumask_of_node(j));
> + if (cpu >= nr_cpu_ids)
> + continue;
> +
> + pkg = topology_physical_package_id(cpu);
> + pkg_total_distance[pkg] += distance;
> + pkg_nr_remote[pkg]++;
This is broken, physical_package_id is not guaranteed to be dense.
> }
> }
> - if (nr_remote)
> - sched_avg_remote_distance = total_distance / nr_remote;
> - else
> - sched_avg_remote_distance = REMOTE_DISTANCE;
>
> + if (!nr_remote)
> + goto cleanup;
> +
> + sched_avg_remote_distance = total_distance / nr_remote;
> +
> + /*
> + * Single average remote distance won't be appropriate if different
> + * packages have different distances to remote packages.
> + */
> + for (i = 0; i < max_pkgs; i++) {
> + if (!pkg_nr_remote[i])
> + continue;
> +
> + pkg_avg_distance = pkg_total_distance[i] / pkg_nr_remote[i];
> +
> + pr_debug("sched: Avg. distance to remote package %d: %d\n", i, pkg_avg_distance);
> +
> + if (pkg_avg_distance != sched_avg_remote_distance)
> + WARN_ONCE(1, "sched: Avg. distances to remote packages are different\n");
> + }
This is pretty yuck.
Also, what's with the pr_debug() stuff?
Anyway, that function was fairly magical, and now it is nearly
impenetrable. If we want this, it needs comments. Definitely more
comments, with nice pictures on.
> +cleanup:
> + kfree(pkg_nr_remote);
> + kfree(pkg_total_distance);
> return sched_avg_remote_distance;
> }
>
> @@ -564,18 +606,7 @@ int arch_sched_node_distance(int from, int to)
> * in the remote package in the same sched group.
> * Simplify NUMA domains and avoid extra NUMA levels including
> * different remote NUMA nodes and local nodes.
> - *
> - * GNR and CWF don't expect systems with more than 2 packages
> - * and more than 2 hops between packages. Single average remote
> - * distance won't be appropriate if there are more than 2
> - * packages as average distance to different remote packages
> - * could be different.
> */
> - WARN_ONCE(topology_max_packages() > 2,
> - "sched: Expect only up to 2 packages for GNR or CWF, "
> - "but saw %d packages when building sched domains.",
> - topology_max_packages());
> -
> d = avg_remote_numa_distance();
> }
> return d;
> --
> 2.52.0
>