[PATCH v5 1/2] sched/fair: Don't trigger active lb if src_rq->curr is not on_rq

From: Xin Zhao

Date: Wed Jun 17 2026 - 03:23:14 EST

Active balancing needs the help by migration threads which will interrupt
task on src_rq. It has a certain impact on overall performance. Active
balancing often fails, there is a check to determine whether the current
task(say it 'curr') on src_rq can run on dst_rq. We have observed that
even that, if curr is a CFS task and on_rq is 0, the failure rate of
active balancing is very high. Below are the test data from a certain
fillback task scenario executed on a platform with 18 CPUs over 300
seconds:

fair: busiest->curr->sched_class == &fair_sched_class
on_rq: busiest->curr->on_rq
total: active balance count triggered of correspondent type
fail: fail to migrate one task in active_load_balance_cpu_stop()

fair && !on_rq !fair && !on_rq
domain total fail total fail
cpu0 0x00003 0 0 0 0
cpu0 0x3ffff 33 33 1 1
cpu1 0x00003 0 0 0 0
cpu1 0x3ffff 42 42 0 0
cpu2 0x0003c 4 4 0 0
cpu2 0x3ffff 12 12 0 0
cpu3 0x0003c 3 3 0 0
cpu3 0x3ffff 8 7 0 0
cpu4 0x0003c 2 2 0 0
cpu4 0x3ffff 5 4 0 0
cpu5 0x0003c 4 4 0 0
cpu5 0x3ffff 8 8 0 0
cpu6 0x003c0 60 60 0 0
cpu6 0x3ffff 28 27 0 0
cpu7 0x003c0 194 184 0 0
cpu7 0x3ffff 35 35 1 1
cpu8 0x003c0 240 228 0 0
cpu8 0x3ffff 28 28 0 0
cpu9 0x003c0 0 0 0 0
cpu9 0x3ffff 10 10 0 0
cpu10 0x03c00 52 50 0 0
cpu10 0x3ffff 0 0 0 0
cpu11 0x03c00 70 68 0 0
cpu11 0x3ffff 1 1 0 0
cpu12 0x03c00 73 72 0 0
cpu12 0x3ffff 0 0 0 0
cpu13 0x03c00 79 76 0 0
cpu13 0x3ffff 0 0 0 0
cpu14 0x3c000 0 0 0 0
cpu14 0x3ffff 57 55 1 0
cpu15 0x3c000 53 52 1 0
cpu15 0x3ffff 30 29 0 0
cpu16 0x3c000 344 341 10 6
cpu16 0x3ffff 103 100 2 1
cpu17 0x3c000 183 179 2 2
cpu17 0x3ffff 78 77 0 0
sum 1839 1791 18 11

In __schedule(), before setting curr to next, during the execution of
pick_next_task(), sched_balance_rq() is called. It will unlock and then
re-lock the rq, creating "holes" during which other CPUs may see zero
rq->curr->on_rq. try_to_block_task() sets curr->on_rq to 0, and during the
rq lock "hole" in pick_next_task(), rq->curr has not yet been assigned to
next, resulting in curr->on_rq being seen as 0.

We do not need to perform active balancing when src_rq->curr is CFS task
but on_rq is 0, as other CFS tasks have been probably checked just before.
For cases where src_rq->curr is a non-CFS task, we retain the affinity
check for dst_rq to trigger active balancing because such task is likely
to wake-up or woken-by src_rq CFS task which has similar affinity
characteristics to migrate. Also, after executing detach_tasks(), rq lock
is released. Tasks on the rq awakened during detach_tasks() may preempt
the previous CFS task. Based on my test(though not shown above), success
rate of active balancing under the condition of !fair && on_rq is 98.4%.
This scenario does not require the use of stop work, but need to add
another path to detach attach task(s). It seems not necessary enough to
add it, Valentin and Vincent have already discussed about it, see [1].

Additionally, sched_class field is a bit far from on_cpu in task_struct.
The previous traversal of cfs_tasks checks on_cpu in can_migrate_task(),
so the additional check for on_rq will not incur much cpu cycle loss, due
to cache locality.

Two reasons why not check sched_class and on_rq of busiest->curr with the
cpumask_test_cpu() check:
1. Let the PATCH not introduce new cases that skip logic for resetting
balance_interval to min_interval.
2. The check of whether busiest cpu has been just triggered active balance
filters a bit more cases than the check of sched_class and on_rq.

[1]: https://lore.kernel.org/lkml/20190815145107.5318-5-valentin.schneider@xxxxxxx/

Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
---

Change in v5:
- Get rid of the 'busiest->curr->sched_class == &fair_sched_class' check,
as suggested by Valentin Schneider.
- Re-test the new condition, adjust and enrich the related commit log.

Change in v4:
- Add comment to explain why need to check busiest->curr->on_rq,
as suggested by Valentin Schneider.
- Restructure the PATCH code, add one more label, make the code more
comfortable to read,
as suggested by Valentin Schneider.
- Link to v4: https://lore.kernel.org/all/20260616071859.343253-2-jackzxcui1989@xxxxxxx/

Change in v3:
- Consider the cost by sched_class and on_rq check,
as suggested by Aiqun(Maira) Yu.
Move the check after the check of whether busiest cpu has been just
triggered active balance.
- Link to v3: https://lore.kernel.org/all/20260615053809.3587677-2-jackzxcui1989@xxxxxxx/

Change in v2:
- Add reason in the commit log why we can see zero rq->curr->on_rq when we
hold rq lock,
as suggested by Valentin Schneider.
- Link to v2: https://lore.kernel.org/all/20260613073228.1951105-1-jackzxcui1989@xxxxxxx/

v1:
- Link to v1: https://lore.kernel.org/all/20260603125938.1938115-1-jackzxcui1989@xxxxxxx/
---
kernel/sched/fair.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b5819c489..2b9653623 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13436,12 +13436,21 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
* ->active_balance_work. Once set, it's cleared
* only after active load balance is finished.
*/
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (busiest->active_balance)
+ goto no_active_balance;
+
+ /*
+ * @busiest dropped its rq_lock in the middle of
+ * scheduling out its ->curr task (->on_rq := 0), no
+ * need to forcefully punt it away with active balance.
+ */
+ if (!busiest->curr->on_rq)
+ goto no_active_balance;

+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+no_active_balance:
preempt_disable();
raw_spin_rq_unlock_irqrestore(busiest, flags);
if (active_balance) {
--
2.34.1