[PATCH] sched/fair: Stabilize idle SMT core selection with asym-capacity
From: Andrea Righi
Date: Tue Jun 30 2026 - 11:29:02 EST
select_idle_capacity() scans all logical CPUs also when it is looking
for a fully idle SMT core. Two concurrent wakeups can therefore observe
the same core as idle, encounter different siblings first, and place one
task on each sibling while another core remains unused.
Make every logical CPU of a selected idle core resolve to the same
stable CPU representative within the scan's existing affinity and
scheduling-domain mask. If the first task is enqueued before the next
scan examines the core, that scan rejects the now-busy core. If both
scans observe the core as idle, they select the same runqueue even if
the first enqueue becomes visible before the second scan finishes,
exposing the imbalance to the load balancer.
The symmetric CPU idle selection path is subject to the same race, but
normally returns as soon as select_idle_core() finds a fully idle core,
reducing the conflict window. The per-CPU capacity scan can retain an
idle-core candidate while evaluating other CPUs, giving concurrent
wakeups more opportunity to select different siblings of the same SMT
core. Therefore, limit the normalization to the asym-capacity path,
where this behavior has a measurable impact.
On NVIDIA Vera Rubin (arm64, 176 CPUs/88 cores per NUMA node), a
CPU-intensive NVPL SGEMM workload restricted to 88 threads (one per
core) showed a consistent 23% increase in mean throughput across
multiple runs.
For comparison, DCPerf MediaWiki running at system saturation (with all
SMT siblings busy) showed neither a benefit nor a regression: throughput
and Nginx request latency remained within measurement error.
Likewise, schbench under partially idle conditions showed no material
change in wakeup latency, request latency, or throughput (within 0.1%).
Tail wakeup latency was more consistent across runs with this change
applied.
Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx>
---
kernel/sched/fair.c | 19 +++++++++++++++++--
1 file changed, 17 insertions(+), 2 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d78467ec6ee13..f846fbe7379f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8647,6 +8647,16 @@ enum asym_fits_state {
ASYM_IDLE_CORE_BIAS = -3,
};
+/*
+ * Return a stable CPU representative of @cpu's SMT core within @cpus.
+ */
+static int select_idle_core_cpu(int cpu, const struct cpumask *cpus)
+{
+ int sibling = cpumask_first_and(cpu_smt_mask(cpu), cpus);
+
+ return sibling < nr_cpu_ids ? sibling : cpu;
+}
+
/*
* Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
* the task fits. If no CPU is big enough, but there are idle ones, try to
@@ -8661,6 +8671,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
* collapses to the plain capacity scan.
*/
bool has_idle_core = sched_smt_active() && test_idle_cores(target);
+ bool best_idle_core = false;
unsigned long task_util, util_min, util_max, best_cap = 0;
int fits, best_fits = ASYM_IDLE_THREAD_MISFIT;
int cpu, best_cpu = -1;
@@ -8686,7 +8697,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
}
for_each_cpu_wrap(cpu, cpus, target) {
- bool preferred_core = !has_idle_core || is_core_idle(cpu);
+ bool idle_core = !sched_smt_active() || is_core_idle(cpu);
+ bool preferred_core = !has_idle_core || idle_core;
unsigned long cpu_cap = capacity_of(cpu);
/*
@@ -8709,7 +8721,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
* immediately.
*/
if (fits > 0 && preferred_core)
- return cpu;
+ return idle_core ? select_idle_core_cpu(cpu, cpus) : cpu;
/*
* Only the min performance hint (i.e. uclamp_min) doesn't fit.
* Look for the CPU with best capacity.
@@ -8750,6 +8762,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
best_cap = cpu_cap;
best_cpu = cpu;
best_fits = fits;
+ best_idle_core = idle_core;
}
}
@@ -8765,6 +8778,8 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
*/
if (has_idle_core && best_fits > ASYM_IDLE_COMPLETE_MISFIT)
set_idle_cores(target, false);
+ else if (best_idle_core)
+ best_cpu = select_idle_core_cpu(best_cpu, cpus);
return best_cpu;
}
--
2.54.0