Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection

From: Vincent Guittot

Date: Wed May 06 2026 - 06:29:36 EST


On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@xxxxxxxxxx> wrote:
>
> On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> different per-core frequencies), the wakeup path uses
> select_idle_capacity() and prioritizes idle CPUs with higher capacity
> for better task placement. However, when those CPUs belong to SMT cores,
> their effective capacity can be much lower than the nominal capacity
> when the sibling thread is busy: SMT siblings compete for shared
> resources, so a "high capacity" CPU that is idle but whose sibling is
> busy does not deliver its full capacity. This effective capacity
> reduction cannot be modeled by the static capacity value alone.
>
> Introduce SMT awareness in the asym-capacity idle selection policy: when
> SMT is active, always prefer fully-idle SMT cores over partially-idle
> ones.
>
> Prioritizing fully-idle SMT cores yields better task placement because
> the effective capacity of partially-idle SMT cores is reduced; always
> preferring them when available leads to more accurate capacity usage on
> task wakeup.
>
> On an SMT system with asymmetric CPU capacities, SMT-aware idle
> selection has been shown to improve throughput by around 15-18% for
> CPU-bound workloads, running an amount of tasks equal to the amount of
> SMT cores.
>
> Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> Cc: Christian Loehle <christian.loehle@xxxxxxx>
> Cc: Koba Ko <kobak@xxxxxxxxxx>
> Reviewed-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> Reported-by: Felix Abecassis <fabecassis@xxxxxxxxxx>
> Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 65 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index bbdf537f61154..6a7e4943804b5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> return idle_cpu;
> }
>
> +/*
> + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> + * are more preferred (see select_idle_capacity()).
> + */
> +enum asym_fits_state {
> + /* In descending order of preference */
> + ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> + ASYM_IDLE_CORE_COMPLETE_MISFIT,
> + ASYM_IDLE_THREAD_FITS,
> + ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> + ASYM_IDLE_COMPLETE_MISFIT,
> +
> + /* util_fits_cpu() bias for an idle core. */
> + ASYM_IDLE_CORE_BIAS = -3,
> +};
> +
> /*
> * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> * the task fits. If no CPU is big enough, but there are idle ones, try to
> @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> static int
> select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> {
> + bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> unsigned long task_util, util_min, util_max, best_cap = 0;
> - int fits, best_fits = 0;
> + int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> int cpu, best_cpu = -1;
> struct cpumask *cpus;
>
> @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> util_max = uclamp_eff_value(p, UCLAMP_MAX);
>
> for_each_cpu_wrap(cpu, cpus, target) {
> + bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> unsigned long cpu_cap = capacity_of(cpu);
>
> if (!choose_idle_cpu(cpu, p))
> @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> fits = util_fits_cpu(task_util, util_min, util_max, cpu);
>
> /* This CPU fits with all requirements */
> - if (fits > 0)
> + if (fits > 0 && preferred_core)
> return cpu;
> /*
> * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> */
> else if (fits < 0)
> cpu_cap = get_actual_cpu_capacity(cpu);
> + /*
> + * fits > 0 implies we are not on a preferred core
> + * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> + * so the effective range becomes
> + * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> + * ASYM_IDLE_COMPLETE_MISFIT - does not fit
> + * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> + * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> + */
> + else if (fits > 0)
> + fits = ASYM_IDLE_THREAD_FITS;
> +
> + /*
> + * If we are on a preferred core, translate the range of fits
> + * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> + * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> + * This ensures that an idle core is always given priority over
> + * (partially) busy core.
> + *
> + * A fully fitting idle core would have returned early and hence
> + * fits > 0 for preferred_core need not be dealt with.
> + */
> + if (preferred_core)
> + fits += ASYM_IDLE_CORE_BIAS;

It might be good to add a comment stating that if the system doesn't
have SMT, prefers_idle_core and preferred_core are always true.

This is okay because CPU == Core in this case but the value differs
from the default 0 or -1 of util_fits_cpu

>
> /*
> - * First, select CPU which fits better (-1 being better than 0).
> + * First, select CPU which fits better (lower is more preferred).
> * Then, select the one with best capacity at same level.
> */
> if ((fits < best_fits) ||
> @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> }
> }
>
> + /*
> + * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]

s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
ASYM_IDLE_CORE_COMPLETE_MISFIT]

Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS

> + * range means the chosen CPU is in a fully idle SMT core. Values above
> + * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.

s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

> + *
> + * The asym-capacity wakeup path returns from select_idle_sibling()
> + * after this function and never runs select_idle_cpu(), so the usual
> + * select_idle_cpu() tail that clears idle cores must live here when the
> + * idle-core preference did not win.
> + */
> + if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)

s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/

> + set_idle_cores(target, false);
> +
> return best_cpu;
> }
>
> @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> unsigned long util_max,
> int cpu)
> {
> - if (sched_asym_cpucap_active())
> + if (sched_asym_cpucap_active()) {
> /*
> * Return true only if the cpu fully fits the task requirements
> * which include the utilization and the performance hints.
> + *
> + * When SMT is active, also require that the core has no busy
> + * siblings.
> */
> - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + return (!sched_smt_active() || is_core_idle(cpu)) &&
> + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> + }
>
> return true;
> }
> --
> 2.54.0
>