Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

From: Shrikanth Hegde

Date: Wed Nov 12 2025 - 23:25:35 EST

On 11/13/25 2:40 AM, Tim Chen wrote:

On Wed, 2025-11-12 at 12:21 +0100, Peter Zijlstra wrote:

On Wed, Nov 12, 2025 at 04:39:43PM +0530, Shrikanth Hegde wrote:

So perhaps this is the better option -- or did I overlook something with
should_we_balance? It doesn't look like that will make a different
decision on the retry.

I think in newidle balance, these checks are there in swb to bail of load balance.
redo logic catches it right?

Urgh, my brain still thinks we're not serializing on newidle. Perhaps I
should make this 2 patches, one moving the serializing and one adding it
to newidle.

env->dst_rq lock is taken only in attach_tasks, meanwhile, if the wakeup happened,
pending would be set. is irq enabled or remote CPU can set ttwu_pending on this rq?

if (env->idle == CPU_NEWLY_IDLE) {
if (env->dst_rq->nr_running > 0 || env->dst_rq->ttwu_pending)
return 0;
return 1;
}

Right, that could get tickled.

How about something like the following on top of v4 patch?
This will avoid releasing the lock and take care of the NEWLY_IDLE case.

Tim

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 43c5ec039633..26179f4b77f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -11772,14 +11772,13 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
.fbq_type = all,
.tasks = LIST_HEAD_INIT(env.tasks),
};
- bool need_unlock;
+ bool need_unlock = false;
cpumask_and(cpus, sched_domain_span(sd), cpu_active_mask);
schedstat_inc(sd->lb_count[idle]);
redo:
- need_unlock = false;
if (!should_we_balance(&env)) {
*continue_balancing = 0;
goto out_balanced;
@@ -11916,9 +11915,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
if (!cpumask_subset(cpus, env.dst_grpmask)) {
env.loop = 0;
env.loop_break = SCHED_NR_MIGRATE_BREAK;
- if (need_unlock)
- atomic_set_release(&sched_balance_running, 0);
-
+ if (env->idle == CPU_NEWLY_IDLE &&
+ (env->dst_running > 0 || env->dst_rq->ttwu_pending))
+ goto out;

IIUC, we come here, it means busiest cpu was found, but due to
affinity restrictions none of the tasks can come to this cpu.

So a redo is done excluding that busiest cpu if there are cpus other
than the group_mask of this cpu. So doing a redo does make sense specially
for newidle.

So doing bailing out might be wrong.

goto redo;
}
goto out_all_pinned;