[PATCH v4 1/2] sched/fair: Don't trigger active lb if src_rq->curr is CFS and not on_rq

From: Xin Zhao

Date: Tue Jun 16 2026 - 03:24:46 EST


Active balancing needs the help by migration threads which will interrupt
task on src_rq. It has a certain impact on overall performance. Active
balancing often fails, there is a check to determine whether the current
task(say it 'curr') on src_rq can run on dst_rq. We have observed that
even that, if curr is a CFS task and on_rq is 0, the failure rate of
active balancing is very high. Below are the test data from a certain
fillback task scenario executed on a platform with 18 CPUs over 300
seconds:

total: the total count of cases that match
cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr) &&
busiest->curr->sched_class == &fair_sched_class &&
!busiest->curr->on_rq
succ/fail: the active balance success/fail cases that match
cpumask_......->on_rq

total succ fail
cpu0 domain0 00003 0 0 0
cpu0 domain1 3ffff 32 0 32
cpu1 domain0 00003 0 0 0
cpu1 domain1 3ffff 40 0 40
cpu2 domain0 0003c 3 0 3
cpu2 domain1 3ffff 6 0 6
cpu3 domain0 0003c 3 1 2
cpu3 domain1 3ffff 3 0 3
cpu4 domain0 0003c 3 0 3
cpu4 domain1 3ffff 4 0 4
cpu5 domain0 0003c 1 0 1
cpu5 domain1 3ffff 6 0 6
cpu6 domain0 003c0 39 0 39
cpu6 domain1 3ffff 36 0 36
cpu7 domain0 003c0 213 4 209
cpu7 domain1 3ffff 24 2 22
cpu8 domain0 003c0 242 16 226
cpu8 domain1 3ffff 16 0 16
cpu9 domain0 003c0 0 0 0
cpu9 domain1 3ffff 6 1 5
cpu10 domain0 03c00 58 1 57
cpu10 domain1 3ffff 0 0 0
cpu11 domain0 03c00 54 4 50
cpu11 domain1 3ffff 1 0 1
cpu12 domain0 03c00 66 1 65
cpu12 domain1 3ffff 0 0 0
cpu13 domain0 03c00 66 1 65
cpu13 domain1 3ffff 0 0 0
cpu14 domain0 3c000 0 0 0
cpu14 domain1 3ffff 57 5 52
cpu15 domain0 3c000 15 0 15
cpu15 domain1 3ffff 35 0 35
cpu16 domain0 3c000 148 3 145
cpu16 domain1 3ffff 109 1 108
cpu17 domain0 3c000 182 2 180
cpu17 domain1 3ffff 78 1 77

In __schedule(), before setting curr to next, during the execution of
pick_next_task(), sched_balance_rq() is called. It will unlock and then
re-lock the rq, creating "holes" during which other CPUs may see zero
rq->curr->on_rq. try_to_block_task() sets curr->on_rq to 0, and during the
rq lock "hole" in pick_next_task(), rq->curr has not yet been assigned to
next, resulting in curr->on_rq being seen as 0.

We do not need to perform active balancing when src_rq->curr is CFS task
but on_rq is 0, as other CFS tasks have been already checked just before.
For cases where src_rq->curr is a non-CFS task, we retain the affinity
check for dst_rq to trigger active balancing because such task is likely
to wake-up or woken-by src_rq CFS task which has similar affinity
characteristics to migrate.

Two reasons why not check sched_class and on_rq of busiest->curr with the
cpumask_test_cpu() check:
1. Let the PATCH not introduce new cases that skip logic for resetting
balance_interval to min_interval.
2. The check of whether busiest cpu has been just triggered active balance
filters a bit more cases than the check of sched_class and on_rq.

Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
---

Change in v4:
- Add comment to explain why need to check busiest->curr->on_rq,
as suggested by Valentin Schneider.
- Restructure the PATCH code, add one more label, make the code more
comfortable to read,
as suggested by Valentin Schneider.

Change in v3:
- Consider the cost by sched_class and on_rq check,
as suggested by Aiqun(Maira) Yu.
Move the check after the check of whether busiest cpu has been just
triggered active balance.
- Link to v3: https://lore.kernel.org/all/20260615053809.3587677-2-jackzxcui1989@xxxxxxx/

Change in v2:
- Add reason in the commit log why we can see zero rq->curr->on_rq when we
hold rq lock,
as suggested by Valentin Schneider.
- Link to v2: https://lore.kernel.org/all/20260613073228.1951105-1-jackzxcui1989@xxxxxxx/

v1:
- Link to v1: https://lore.kernel.org/all/20260603125938.1938115-1-jackzxcui1989@xxxxxxx/
---
kernel/sched/fair.c | 20 +++++++++++++++-----
1 file changed, 15 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5819c489..4391b6e5b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13436,12 +13436,22 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
* ->active_balance_work. Once set, it's cleared
* only after active load balance is finished.
*/
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (busiest->active_balance)
+ goto no_active_balance;

+ /*
+ * @busiest dropped its rq_lock in the middle of
+ * scheduling out its ->curr task (->on_rq := 0), no
+ * need to forcefully punt it away with active balance.
+ */
+ if ((busiest->curr->sched_class == &fair_sched_class) &&
+ !busiest->curr->on_rq)
+ goto no_active_balance;
+
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+no_active_balance:
preempt_disable();
raw_spin_rq_unlock_irqrestore(busiest, flags);
if (active_balance) {
--
2.34.1