Re: [PATCH 3/5] sched/fair: Prefer fully-idle SMT cores in asym-capacity idle selection
From: Andrea Righi
Date: Wed May 06 2026 - 14:15:54 EST
Hi Vincent,
On Wed, May 06, 2026 at 12:29:10PM +0200, Vincent Guittot wrote:
> On Tue, 28 Apr 2026 at 16:44, Andrea Righi <arighi@xxxxxxxxxx> wrote:
> >
> > On systems with asymmetric CPU capacity (e.g., ACPI/CPPC reporting
> > different per-core frequencies), the wakeup path uses
> > select_idle_capacity() and prioritizes idle CPUs with higher capacity
> > for better task placement. However, when those CPUs belong to SMT cores,
> > their effective capacity can be much lower than the nominal capacity
> > when the sibling thread is busy: SMT siblings compete for shared
> > resources, so a "high capacity" CPU that is idle but whose sibling is
> > busy does not deliver its full capacity. This effective capacity
> > reduction cannot be modeled by the static capacity value alone.
> >
> > Introduce SMT awareness in the asym-capacity idle selection policy: when
> > SMT is active, always prefer fully-idle SMT cores over partially-idle
> > ones.
> >
> > Prioritizing fully-idle SMT cores yields better task placement because
> > the effective capacity of partially-idle SMT cores is reduced; always
> > preferring them when available leads to more accurate capacity usage on
> > task wakeup.
> >
> > On an SMT system with asymmetric CPU capacities, SMT-aware idle
> > selection has been shown to improve throughput by around 15-18% for
> > CPU-bound workloads, running an amount of tasks equal to the amount of
> > SMT cores.
> >
> > Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> > Cc: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
> > Cc: Christian Loehle <christian.loehle@xxxxxxx>
> > Cc: Koba Ko <kobak@xxxxxxxxxx>
> > Reviewed-by: K Prateek Nayak <kprateek.nayak@xxxxxxx>
> > Reported-by: Felix Abecassis <fabecassis@xxxxxxxxxx>
> > Signed-off-by: Andrea Righi <arighi@xxxxxxxxxx>
> > ---
> > kernel/sched/fair.c | 70 +++++++++++++++++++++++++++++++++++++++++----
> > 1 file changed, 65 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index bbdf537f61154..6a7e4943804b5 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7989,6 +7989,22 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > return idle_cpu;
> > }
> >
> > +/*
> > + * Idle-capacity scan ranks transformed util_fits_cpu() outcomes; lower values
> > + * are more preferred (see select_idle_capacity()).
> > + */
> > +enum asym_fits_state {
> > + /* In descending order of preference */
> > + ASYM_IDLE_CORE_UCLAMP_MISFIT = -4,
> > + ASYM_IDLE_CORE_COMPLETE_MISFIT,
> > + ASYM_IDLE_THREAD_FITS,
> > + ASYM_IDLE_THREAD_UCLAMP_MISFIT,
> > + ASYM_IDLE_COMPLETE_MISFIT,
> > +
> > + /* util_fits_cpu() bias for an idle core. */
> > + ASYM_IDLE_CORE_BIAS = -3,
> > +};
> > +
> > /*
> > * Scan the asym_capacity domain for idle CPUs; pick the first idle one on which
> > * the task fits. If no CPU is big enough, but there are idle ones, try to
> > @@ -7997,8 +8013,9 @@ static int select_idle_cpu(struct task_struct *p, struct sched_domain *sd, bool
> > static int
> > select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > {
> > + bool prefers_idle_core = sched_smt_active() && test_idle_cores(target);
> > unsigned long task_util, util_min, util_max, best_cap = 0;
> > - int fits, best_fits = 0;
> > + int fits, best_fits = ASYM_IDLE_COMPLETE_MISFIT;
> > int cpu, best_cpu = -1;
> > struct cpumask *cpus;
> >
> > @@ -8010,6 +8027,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > util_max = uclamp_eff_value(p, UCLAMP_MAX);
> >
> > for_each_cpu_wrap(cpu, cpus, target) {
> > + bool preferred_core = !prefers_idle_core || is_core_idle(cpu);
> > unsigned long cpu_cap = capacity_of(cpu);
> >
> > if (!choose_idle_cpu(cpu, p))
> > @@ -8018,7 +8036,7 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > fits = util_fits_cpu(task_util, util_min, util_max, cpu);
> >
> > /* This CPU fits with all requirements */
> > - if (fits > 0)
> > + if (fits > 0 && preferred_core)
> > return cpu;
> > /*
> > * Only the min performance hint (i.e. uclamp_min) doesn't fit.
> > @@ -8026,9 +8044,33 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > */
> > else if (fits < 0)
> > cpu_cap = get_actual_cpu_capacity(cpu);
> > + /*
> > + * fits > 0 implies we are not on a preferred core
> > + * but the util fits CPU capacity. Set fits to ASYM_IDLE_THREAD_FITS
> > + * so the effective range becomes
> > + * [ASYM_IDLE_THREAD_FITS, ASYM_IDLE_COMPLETE_MISFIT], where:
> > + * ASYM_IDLE_COMPLETE_MISFIT - does not fit
> > + * ASYM_IDLE_THREAD_UCLAMP_MISFIT - fits with the exception of UCLAMP_MIN
> > + * ASYM_IDLE_THREAD_FITS - fits with the exception of preferred_core
> > + */
> > + else if (fits > 0)
> > + fits = ASYM_IDLE_THREAD_FITS;
> > +
> > + /*
> > + * If we are on a preferred core, translate the range of fits
> > + * of [ASYM_IDLE_THREAD_UCLAMP_MISFIT, ASYM_IDLE_COMPLETE_MISFIT] to
> > + * [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT].
> > + * This ensures that an idle core is always given priority over
> > + * (partially) busy core.
> > + *
> > + * A fully fitting idle core would have returned early and hence
> > + * fits > 0 for preferred_core need not be dealt with.
> > + */
> > + if (preferred_core)
> > + fits += ASYM_IDLE_CORE_BIAS;
>
> It might be good to add a comment stating that if the system doesn't
> have SMT, prefers_idle_core and preferred_core are always true.
>
> This is okay because CPU == Core in this case but the value differs
> from the default 0 or -1 of util_fits_cpu
Ack.
>
> >
> > /*
> > - * First, select CPU which fits better (-1 being better than 0).
> > + * First, select CPU which fits better (lower is more preferred).
> > * Then, select the one with best capacity at same level.
> > */
> > if ((fits < best_fits) ||
> > @@ -8039,6 +8081,19 @@ select_idle_capacity(struct task_struct *p, struct sched_domain *sd, int target)
> > }
> > }
> >
> > + /*
> > + * A value in the [ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_BIAS]
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
>
> ASYM_IDLE_CORE_BIAS is an offset to move an idle core that doesn't
> fully fit in the preferred range [ASYM_IDLE_CORE_UCLAMP_MISFIT,
> ASYM_IDLE_CORE_COMPLETE_MISFIT]
>
> Keeping in mind that ASYM_IDLE_CORE_BIAS == -3 == ASYM_IDLE_CORE_BIAS
Ah yes, using ASYM_IDLE_CORE_BIAS is just confusing, we should definitely use
[ASYM_IDLE_CORE_UCLAMP_MISFIT, ASYM_IDLE_CORE_COMPLETE_MISFIT]. Will fix this.
>
> > + * range means the chosen CPU is in a fully idle SMT core. Values above
> > + * ASYM_IDLE_CORE_BIAS mean we never ranked such a CPU best.
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
Ack.
>
> > + *
> > + * The asym-capacity wakeup path returns from select_idle_sibling()
> > + * after this function and never runs select_idle_cpu(), so the usual
> > + * select_idle_cpu() tail that clears idle cores must live here when the
> > + * idle-core preference did not win.
> > + */
> > + if (prefers_idle_core && best_fits > ASYM_IDLE_CORE_BIAS)
>
> s/ASYM_IDLE_CORE_BIAS/ASYM_IDLE_CORE_COMPLETE_MISFIT/
Ack.
>
> > + set_idle_cores(target, false);
> > +
> > return best_cpu;
> > }
> >
> > @@ -8047,12 +8102,17 @@ static inline bool asym_fits_cpu(unsigned long util,
> > unsigned long util_max,
> > int cpu)
> > {
> > - if (sched_asym_cpucap_active())
> > + if (sched_asym_cpucap_active()) {
> > /*
> > * Return true only if the cpu fully fits the task requirements
> > * which include the utilization and the performance hints.
> > + *
> > + * When SMT is active, also require that the core has no busy
> > + * siblings.
> > */
> > - return (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > + return (!sched_smt_active() || is_core_idle(cpu)) &&
> > + (util_fits_cpu(util, util_min, util_max, cpu) > 0);
> > + }
> >
> > return true;
> > }
> > --
> > 2.54.0
> >
Thanks,
-Andrea