[PATCH v4 1/9] sched/topology: Compute sd_weight considering cpuset partitions
From: K Prateek Nayak
Date: Thu Mar 12 2026 - 00:45:20 EST
The "sd_weight" used for calculating the load balancing interval, and
its limits, considers the span weight of the entire topology level
without accounting for cpuset partitions.
For example, consider a large system of 128CPUs divided into 8 * 16CPUs
partition which is typical when deploying virtual machines:
[ PKG Domain: 128CPUs ]
[Partition0: 16CPUs][Partition1: 16CPUs] ... [Partition7: 16CPUs]
Although each partition only contains 16CPUs, the load balancing
interval is set to a minimum of 128 jiffies considering the span of the
entire domain with 128CPUs which can lead to longer imbalances within
the partition although balancing within is cheaper with 16CPUs.
Compute the "sd_weight" after computing the "sd_span" considering the
cpu_map covered by the partition, and set the load balancing interval,
and its limits accordingly.
For the above example, the balancing intervals for the partitions PKG
domain changes as follows:
before after
balance_interval 128 16
min_interval 128 16
max_interval 256 32
Intervals are now proportional to the CPUs in the partitioned domain as
was intended by the original formula.
Fixes: cb83b629bae03 ("sched/numa: Rewrite the CONFIG_NUMA sched domain support")
Reviewed-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>
Reviewed-by: Chen Yu <yu.c.chen@xxxxxxxxx>
Reviewed-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Signed-off-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
---
Changelog v3..v4:
o Illustrated the changes in the load balancing intervals with an
example. (Shrikanth)
o Collected the tags from Chenyu, Shrikanth, and Valentin. (Thanks a
ton!)
---
kernel/sched/topology.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 061f8c85f555..34b20b0e1867 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1645,8 +1645,6 @@ sd_init(struct sched_domain_topology_level *tl,
struct cpumask *sd_span;
u64 now = sched_clock();
- sd_weight = cpumask_weight(tl->mask(tl, cpu));
-
if (tl->sd_flags)
sd_flags = (*tl->sd_flags)();
if (WARN_ONCE(sd_flags & ~TOPOLOGY_SD_FLAGS,
@@ -1654,8 +1652,6 @@ sd_init(struct sched_domain_topology_level *tl,
sd_flags &= TOPOLOGY_SD_FLAGS;
*sd = (struct sched_domain){
- .min_interval = sd_weight,
- .max_interval = 2*sd_weight,
.busy_factor = 16,
.imbalance_pct = 117,
@@ -1675,7 +1671,6 @@ sd_init(struct sched_domain_topology_level *tl,
,
.last_balance = jiffies,
- .balance_interval = sd_weight,
/* 50% success rate */
.newidle_call = 512,
@@ -1693,6 +1688,11 @@ sd_init(struct sched_domain_topology_level *tl,
cpumask_and(sd_span, cpu_map, tl->mask(tl, cpu));
sd_id = cpumask_first(sd_span);
+ sd_weight = cpumask_weight(sd_span);
+ sd->min_interval = sd_weight;
+ sd->max_interval = 2 * sd_weight;
+ sd->balance_interval = sd_weight;
+
sd->flags |= asym_cpu_capacity_classify(sd_span, cpu_map);
WARN_ONCE((sd->flags & (SD_SHARE_CPUCAPACITY | SD_ASYM_CPUCAPACITY)) ==
--
2.34.1