Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

From: Vincent Guittot

Date: Mon May 11 2026 - 09:16:59 EST


On Sat, 9 May 2026 at 20:10, Andrea Righi <arighi@xxxxxxxxxx> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active, always prefer fully-idle SMT cores over partially-idle
> ones.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities (NVIDIA Vera Rubin),
> SMT-aware idle selection has been shown to improve throughput by around
> 15-18% over NO_ASYM mainline and by around 60% over ASYM mainline, for
> CPU-bound workloads (NVBLAS) running an amount of tasks equal to the
> amount of SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Christian Loehle <christian.loehle@xxxxxxx>
> Cc: Koba Ko <kobak@xxxxxxxxxx>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Reported-by: Felix Abecassis <fabecassis@xxxxxxxxxx>
> Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx>

I still have comments about the description and naming below but
overall, the patch looks good to me

Reviewed-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>

> ---
> kernel/sched/fair.c | 119 +++++++++++++++++++++++++++++++++++++++++---
> 1 file changed, 113 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 960a1a9696b98..6f0835c15ee11 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8018,6 +8018,54 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return idle_cpu;
> }
>
> +/*
> + * Idle-capacity scan converts util_fits_cpu() outcomes into preference ranks,
> + * where lower values indicate a better fit - see select_idle_capacity().
> + *
> + * A CPU that both fits the task and sits on a fully-idle SMT core is returned
> + * immediately and is never assigned one of these ranks. On !SMT every CPU is
> + * its own "core", so the early return covers all fits-and-idle cases and the
> + * core-tier ranks below become unreachable.
> + *
> + * Rank Val Tier Meaning
> + * ------------------------------ --- ------ ---------------------------
> + * ASYM_IDLE_CORE_UCLAMP_MISFIT -4 core Idle core; capacity fits
> + * util but uclamp_min misses.
> + * ASYM_IDLE_CORE_COMPLETE_MISFIT -3 core Idle core; capacity does
> + * not fit. Still beats every
> + * thread-tier rank: a busy
> + * sibling cuts effective
> + * capacity more than a
> + * misfit hurts a quiet core.
> + * ASYM_IDLE_THREAD_FITS -2 thread Busy SMT sibling; capacity
> + * fits util + uclamp.
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT -1 thread Busy SMT sibling; capacity
> + * fits but uclamp_min misses
> + * (native util_fits_cpu()
> + * return value).
> + * ASYM_IDLE_COMPLETE_MISFIT 0 thread Busy SMT sibling; capacity
> + * does not fit.
> + *
> + * ASYM_IDLE_CORE_BIAS (-3) is an offset, not a state. On an idle core,
> + * fits += ASYM_IDLE_CORE_BIAS rebases thread-tier ranks into the core tier:
> + *
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT (-1) + BIAS -> CORE_UCLAMP_MISFIT (-4)
> + * ASYM_IDLE_COMPLETE_MISFIT (0) + BIAS -> CORE_COMPLETE_MISFIT (-3)
> + *
> + * ASYM_IDLE_THREAD_FITS (-2) is never rebased because a fully-fitting idle-core
> + * candidate early-returns from select_idle_capacity().
> + */
> +enum asym_fits_state {
> + ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,

ASYM_IDLE_UCLAMP_MISFIT
See why in comments for select_idle_capacity()

> + ASYM_IDLE_CORE_COMPLETE_MISFIT,

ASYM_IDLE_COMPLETE_MISFIT,

> + ASYM_IDLE_THREAD_FITS,
> + ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> + ASYM_IDLE_COMPLETE_MISFIT,

ASYM_IDLE_THREAD_MISFIT,

> +
> + /* util_fits_cpu() bias for idle core */
> + ASYM_IDLE_CORE_BIAS = -3,
> +};
> +
> /*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -8026,8 +8074,14 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> static int
> select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> {
> + /*
> + * On !SMT systems, has_idle_core is always false and preferred_core
> + * is always true (CPU == core), so the SMT preference logic below
> + * collapses to the plain capacity scan.
> + */
> + bool has_idle_core = sched_smt_active() && test_idle_cores(target);
> unsigned long task_util, util_min, util_max, best_cap = 0;
> - int fits, best_fits = 0;
> + int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
>
> @@ -8039,6 +8093,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> + bool preferred_core = !has_idle_core || is_core_idle(cpu);

If sched_smt_active() is true and test_idle_cores(target) is false
(meaning we have SMT but no idle core), then has_idle_core is false
and preferred_core is true. We will returns immediatly if
util_fits_cpu and we will use the ASYM_IDLE_CORE_* values otherwise.
So I think that we should remove the "CORE_" in the naming

ASYM_IDLE_THREAD_* values are only used when we are promised to find
an idle core with SMT

> unsigned long cpu_cap = capacity_of(cpu);
>
> if (!choose_idle_cpu(cpu, p))
> @@ -8046,8 +8101,13 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
>
> fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> - /* This CPU fits with all requirements */
> - if (fits > 0)
> + /*
> + * Perfect fit: capacity satisfies util + uclamp and the CPU
> + * sits on a fully-idle SMT core (or this is a !SMT system).

Or there is no idle core to find.


> + * Short-circuit the rank-based selection and return
> + * immediately.
> + */
> + if (fits > 0 && preferred_core)
> return cpu;
> /*
> * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -8055,9 +8115,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> */
> else if (fits < 0)
> cpu_cap = get_actual_cpu_capacity(cpu);
> + /*
> + * fits > 0 implies we are not on a preferred core, but the util
> + * fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> + * so the effective range becomes
> + * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> + * ASYM_IDLE_COMPLETE_MISFIT - does not fit
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> + * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> + */
> + else if (fits > 0)
> + fits = ASYM_IDLE_THREAD_FITS;
>
> /*
> - * First, select CPU which fits better (-1 being better than 0).
> + * If we are on a preferred core, translate the range of fits
> + * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> + * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> + * This ensures that an idle core is always given priority over
> + * (partially) busy core.
> + *
> + * A fully fitting idle core would have returned early and hence
> + * fits > 0 for preferred_core need not be dealt with.
> + */
> + if (preferred_core)
> + fits += ASYM_IDLE_CORE_BIAS;
> +
> + /*
> + * First, select CPU which fits better (lower is more preferred).
> * Then, select the one with best capacity at same level.
> */
> if ((fits < best_fits) ||
> @@ -8068,6 +8152,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> }
> }
>
> + /*
> + * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]
> + * range means the chosen CPU is in a fully idle SMT core. Values above
> + * ASYM_IDLE_CORE_COMPLETE_MISFIT mean we never ranked such a CPU best.
> + *
> + * The asym-capacity wakeup path returns from select_idle_sibling()
> + * after this function and never runs select_idle_cpu(), so the usual
> + * select_idle_cpu() tail that clears idle cores must live here when the
> + * idle-core preference did not win.
> + */
> + if (has_idle_core && best_fits > ASYM_IDLE_CORE_COMPLETE_MISFIT)
> + set_idle_cores(target, false);
> +
> return best_cpu;
> }
>
> @@ -8076,12 +8173,22 @@ static inline bool asym_fits_cpu(unsigned long util,
> unsigned long util_max,
> int cpu)
> {
> - if (sched_asym_cpucap_active())
> + if (sched_asym_cpucap_active()) {
> /*
> * Return true only if the cpu fully fits the task requirements
> * which include the utilization and the performance hints.
> + *
> + * When SMT is active, also require that the core has no busy
> + * siblings.
> + *
> + * Note: gating on is_core_idle() also makes the early-bailout
> + * candidates in select_idle_sibling() (target, prev,
> + * recent_used_cpu) idle-core-aware on ASYM+SMT, which the
> + * NO_ASYM path does not do.
> */
> - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + return (!sched_smt_active() || is_core_idle(cpu)) &&
> + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + }
>
> return true;
> }
> --
> 2.54.0
>