[tip: sched/core] sched/fair: Don't trigger active lb if src_rq->curr is not on_rq
From: tip-bot2 for Xin Zhao
Date: Tue Jun 30 2026 - 05:07:30 EST
The following commit has been merged into the sched/core branch of tip:
Commit-ID: cdd9e37ed46ac1a80c1c9d4ec430096a0f1c419c
Gitweb: https://git.kernel.org/tip/cdd9e37ed46ac1a80c1c9d4ec430096a0f1c419c
Author: Xin Zhao <jackzxcui1989@xxxxxxx>
AuthorDate: Wed, 17 Jun 2026 15:21:50 +08:00
Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
CommitterDate: Tue, 30 Jun 2026 10:56:56 +02:00
sched/fair: Don't trigger active lb if src_rq->curr is not on_rq
Active load balancing relies on migration threads, which temporarily preempt
tasks on the source runqueue (src_rq). This preemption can negatively impact
overall system performance. The active balancing logic includes a check to
verify whether the current task (curr) on src_rq can actually run on the
destination runqueue (dst_rq). We have observed that when curr is a CFS task
and its on_rq flag is 0, the active balancing failure rate is exceptionally
high. The following table summarizes test data collected over 300 seconds on an
18-CPU platform under a specific fillback task scenario:
fair: busiest->curr->sched_class == &fair_sched_class
on_rq: busiest->curr->on_rq
total: active balance count triggered of correspondent type
fail: fail to migrate one task in active_load_balance_cpu_stop()
fair && !on_rq !fair && !on_rq
domain total fail total fail
cpu0 0x00003 0 0 0 0
cpu0 0x3ffff 33 33 1 1
cpu1 0x00003 0 0 0 0
cpu1 0x3ffff 42 42 0 0
cpu2 0x0003c 4 4 0 0
cpu2 0x3ffff 12 12 0 0
cpu3 0x0003c 3 3 0 0
cpu3 0x3ffff 8 7 0 0
cpu4 0x0003c 2 2 0 0
cpu4 0x3ffff 5 4 0 0
cpu5 0x0003c 4 4 0 0
cpu5 0x3ffff 8 8 0 0
cpu6 0x003c0 60 60 0 0
cpu6 0x3ffff 28 27 0 0
cpu7 0x003c0 194 184 0 0
cpu7 0x3ffff 35 35 1 1
cpu8 0x003c0 240 228 0 0
cpu8 0x3ffff 28 28 0 0
cpu9 0x003c0 0 0 0 0
cpu9 0x3ffff 10 10 0 0
cpu10 0x03c00 52 50 0 0
cpu10 0x3ffff 0 0 0 0
cpu11 0x03c00 70 68 0 0
cpu11 0x3ffff 1 1 0 0
cpu12 0x03c00 73 72 0 0
cpu12 0x3ffff 0 0 0 0
cpu13 0x03c00 79 76 0 0
cpu13 0x3ffff 0 0 0 0
cpu14 0x3c000 0 0 0 0
cpu14 0x3ffff 57 55 1 0
cpu15 0x3c000 53 52 1 0
cpu15 0x3ffff 30 29 0 0
cpu16 0x3c000 344 341 10 6
cpu16 0x3ffff 103 100 2 1
cpu17 0x3c000 183 179 2 2
cpu17 0x3ffff 78 77 0 0
sum 1839 1791 18 11
In __schedule(), before curr is updated to next, pick_next_task() invokes
sched_balance_rq(). This function temporarily unlocks and relocks the runqueue,
creating a window where other CPUs may observe rq->curr->on_rq as 0.
We can safely skip active balancing when src_rq->curr->on_rq == 0, as other
eligible tasks have likely already been evaluated.
We retain the affinity check on dst_rq to trigger active balancing, since such
tasks are often woken by (or wake up) tasks on src_rq that share similar
affinity constraints. Furthermore, detach_tasks() releases the runqueue lock;
any tasks awakened during this window may preempt the previous CFS task. My
testing (data not shown) indicates that active balancing succeeds in 98.4% of
cases where !fair && on_rq.
This scenario does not require a stop-work callback, but would necessitate an
additional detach/attach path. As Valentin and Vincent have already discussed,
this addition does not appear justified at this time (see [1]).
Since can_migrate_task() already checks on_cpu during the cfs_tasks traversal,
adding an on_rq check will have negligible performance overhead due to cache
locality.
There are two reasons for not combining the on_rq check with the
cpumask_test_cpu() check:
- Avoiding new scenarios that would skip the logic for resetting
balance_interval to min_interval.
- The existing check for whether the busiest CPU recently triggered active
load balancing already filters more cases than the on_rq check.
[1]: https://lore.kernel.org/lkml/20190815145107.5318-5-valentin.schneider@xxxxxxx/
Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Reviewed-by: Valentin Schneider <vschneid@xxxxxxxxxx>
Link: https://patch.msgid.link/20260617072151.1173416-2-jackzxcui1989@xxxxxxx
---
kernel/sched/fair.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c575910..7d9ad78 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -13584,12 +13584,21 @@ more_balance:
* ->active_balance_work. Once set, it's cleared
* only after active load balance is finished.
*/
- if (!busiest->active_balance) {
- busiest->active_balance = 1;
- busiest->push_cpu = this_cpu;
- active_balance = 1;
- }
+ if (busiest->active_balance)
+ goto no_active_balance;
+
+ /*
+ * @busiest dropped its rq_lock in the middle of
+ * scheduling out its ->curr task (->on_rq := 0), no
+ * need to forcefully punt it away with active balance.
+ */
+ if (!busiest->curr->on_rq)
+ goto no_active_balance;
+ busiest->active_balance = 1;
+ busiest->push_cpu = this_cpu;
+ active_balance = 1;
+no_active_balance:
preempt_disable();
raw_spin_rq_unlock_irqrestore(busiest, flags);
if (active_balance) {