Re: [PATCH v5 5/6] sched/fair: Allow load balancing between CPUs of identical capacity

From: Christian Loehle

Date: Tue Jun 23 2026 - 03:46:17 EST


On 6/23/26 08:20, Vincent Guittot wrote:
> On Tue, 23 Jun 2026 at 01:55, Ricardo Neri
> <ricardo.neri-calderon@xxxxxxxxxxxxxxx> wrote:
>>
>> sched_balance_find_src_rq() avoids selecting a runqueue with a single
>> running task as busiest if doing so results in migrating the task to a
>> CPU with less than ~5% of extra capacity. It also unintentionally
>> prevents migrations between CPUs of identical capacity.
>>
>> When CONFIG_SCHED_CLUSTER is enabled, load should be balanced across
>> clusters of CPUs with the same capacity. Allowing migration between CPUs
>> of identical capacity is necessary to meet this goal.
>>
>> Use arch_scale_cpu_capacity() to reflect architectural capacity, excluding
>
> capacity_of() reflects not only RT and irq pressure but also thermal
> pressure or system frequency capping.
> If dst cluster is under thermal mitigation but the source cluster is
> not, we probably shouldn't spread tasks across both clusters.
> Have you considered using get_actual_cpu_capacity() instead of
> arch_scale_cpu_capacity() ?

Replacing arch_scale_cpu_capacity() with get_actual_cpu_capacity()
would make the == comparison below very unlikely to be true FWIW.
I think it's fine like that, I will prepare a follow-up anyway to make
it work for our "almost equal capacity" cluster systems and then also
consider switching to get_actual_cpu_capacity() since we include a margin
anyway.

>
>> runtime reductions due to side activity or thermal pressure. Guard this
>> check with the sched_cluster_active static key so that systems without
>> cluster topology are unaffected.
>>
>> Tested-by: Christian Loehle <christian.loehle@xxxxxxx>
>> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@xxxxxxxxxxxxxxx>
>> ---
>> Changes in v5:
>> * Optimized logic to identify same-arch clusters only when needed.
>> * Added Tested-by tag from Christian. Thanks!
>>
>> Changes in v4:
>> * Implemented the check for cluster with a local variable for improved
>> readability.
>>
>> Changes in v3:
>> * Reverted the inverted capacity check; the inverted form incorrectly
>> allows migrations to CPUs of slightly less capacity.
>> * Guarded the check for architectural capacity with the
>> sched_cluster_active static key.
>>
>> Changes in v2:
>> * Used arch_scale_cpu_capacity() instead of capacity_of() to ignore
>> runtime variability.
>> * Inverted the check for runtime capacity. (Christian)
>> * Reworded patch description for clarity.
>> ---
>> kernel/sched/fair.c | 9 ++++++++-
>> 1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e55eb019d2c9..f4eb55cad54d 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12992,13 +12992,20 @@ static struct rq *sched_balance_find_src_rq(struct lb_env *env,
>> */
>> if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
>> nr_running == 1) {
>> + bool same_arch_cluster = static_branch_unlikely(&sched_cluster_active) &&
>> + (arch_scale_cpu_capacity(env->dst_cpu) ==
>> + arch_scale_cpu_capacity(i));
>> bool smt_degraded_cap = sched_smt_active() && !is_core_idle(i);
>>
>> /*
>> * Busy SMT siblings reduce the capacity of CPU @i. Do
>> * not skip it in this case.
>> + *
>> + * CONFIG_SCHED_CLUSTER requires balancing load across clusters
>> + * of identical capacity. Use architectural capacity to ignore
>> + * runtime variability.
>> */
>> - if (!smt_degraded_cap &&
>> + if (!smt_degraded_cap && !same_arch_cluster &&
>> !capacity_greater(capacity_of(env->dst_cpu), capacity))
>> continue;
>> }
>>
>> --
>> 2.43.0
>>