Re: [PATCH v4 1/2] sched: Create architecture specific sched domain distances

From: Chen, Yu C

Date: Sat Sep 27 2025 - 08:34:38 EST


On 9/20/2025 1:50 AM, Tim Chen wrote:
Allow architecture specific sched domain NUMA distances that are
modified from actual NUMA node distances for the purpose of building
NUMA sched domains.

Keep actual NUMA distances separately if modified distances
are used for building sched domains. Such distances
are still needed as NUMA balancing benefits from finding the
NUMA nodes that are actually closer to a task numa_group.

Consolidate the recording of unique NUMA distances in an array to
sched_record_numa_dist() so the function can be reused to record NUMA
distances when the NUMA distance metric is changed.

No functional change and additional distance array
allocated if there're no arch specific NUMA distances
being defined.

Co-developed-by: Vinicius Costa Gomes <vinicius.gomes@xxxxxxxxx>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@xxxxxxxxx>
Signed-off-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>

[snip]

@@ -1591,10 +1591,12 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
enum numa_topology_type sched_numa_topology_type;
static int sched_domains_numa_levels;
+static int sched_numa_node_levels;

I agree that the benefit of maintaining two NUMA distances - one for the
sched_domain and another for the NUMA balancing/page allocation policy - is
to avoid complicating the sched_domain hierarchy while preserving the
advantages of NUMA locality.

Meanwhile, I wonder if we could also add a "orig" prefix to the original
NUMA distance. This way, we can quickly understand its meaning later.
For example,
sched_orig_node_levels
sched_orig_node_distance

static int sched_domains_curr_level;
int sched_max_numa_distance;
static int *sched_domains_numa_distance;
+static int *sched_numa_node_distance;
static struct cpumask ***sched_domains_numa_masks;
#endif /* CONFIG_NUMA */
@@ -1808,10 +1810,10 @@ bool find_numa_distance(int distance)
return true;
rcu_read_lock();
- distances = rcu_dereference(sched_domains_numa_distance);
+ distances = rcu_dereference(sched_numa_node_distance);
if (!distances)
goto unlock;
- for (i = 0; i < sched_domains_numa_levels; i++) {
+ for (i = 0; i < sched_numa_node_levels; i++) {
if (distances[i] == distance) {
found = true;
break;
@@ -1887,14 +1889,48 @@ static void init_numa_topology_type(int offline_node)
#define NR_DISTANCE_VALUES (1 << DISTANCE_BITS)
-void sched_init_numa(int offline_node)
+/*
+ * An architecture could modify its NUMA distance, to change
+ * grouping of NUMA nodes and number of NUMA levels when creating
+ * NUMA level sched domains.
+ *
+ * A NUMA level is created for each unique
+ * arch_sched_node_distance.
+ */
+static bool __modified_sched_node_dist = true;
+
+int __weak arch_sched_node_distance(int from, int to)
{
- struct sched_domain_topology_level *tl;
- unsigned long *distance_map;
+ if (__modified_sched_node_dist)
+ __modified_sched_node_dist = false;
+
+ return node_distance(from, to);
+}
+
+static bool modified_sched_node_distance(void)
+{
+ /*
+ * Call arch_sched_node_distance()
+ * to determine if arch_sched_node_distance
+ * has been modified from node_distance()
+ * to arch specific distance.
+ */
+ arch_sched_node_distance(0, 0);
+ return __modified_sched_node_dist;
+}
+

If our goal is to figure out whether the arch_sched_node_distance()
has been overridden, how about the following alias?

int __weak arch_sched_node_distance(int from, int to)
{
return __node_distance(from, to);
}
int arch_sched_node_distance_original(int from, int to) __weak __alias(arch_sched_node_distance);

static bool arch_sched_node_distance_is_overridden(void)
{
return arch_sched_node_distance != arch_sched_node_distance_original;
}

so arch_sched_node_distance_is_overridden() can replace modified_sched_node_distance()

thanks,
Chenyu