Re: [PATCH v3 1/2] sched/fair: Don't trigger active lb if src_rq->curr is CFS and not on_rq

From: Phil Auld

Date: Mon Jun 15 2026 - 09:58:44 EST

Hi,

On Mon, Jun 15, 2026 at 03:39:22PM +0200 Valentin Schneider wrote:
> On 15/06/26 13:38, Xin Zhao wrote:
> > Active balancing needs the help by migration threads which will interrupt
> > task on src_rq. It has a certain impact on overall performance. Active
> > balancing often fails, there is a check to determine whether the current
> > task(say it 'curr') on src_rq can run on dst_rq. We have observed that
> > even that, if curr is a CFS task and on_rq is 0, the failure rate of
> > active balancing is very high. Below are the test data from a certain
> > fillback task scenario executed on a platform with 18 CPUs over 300
> > seconds:
> >
> > total: the total count of cases that match
> > cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr) &&
> > busiest->curr->sched_class == &fair_sched_class &&
> > !busiest->curr->on_rq
> > succ/fail: the active balance success/fail cases that match
> > cpumask_......->on_rq
> >
> > total succ fail
> > cpu0 domain0 00003 0 0 0
> > cpu0 domain1 3ffff 32 0 32
> > cpu1 domain0 00003 0 0 0
> > cpu1 domain1 3ffff 40 0 40
> > cpu2 domain0 0003c 3 0 3
> > cpu2 domain1 3ffff 6 0 6
> > cpu3 domain0 0003c 3 1 2
> > cpu3 domain1 3ffff 3 0 3
> > cpu4 domain0 0003c 3 0 3
> > cpu4 domain1 3ffff 4 0 4
> > cpu5 domain0 0003c 1 0 1
> > cpu5 domain1 3ffff 6 0 6
> > cpu6 domain0 003c0 39 0 39
> > cpu6 domain1 3ffff 36 0 36
> > cpu7 domain0 003c0 213 4 209
> > cpu7 domain1 3ffff 24 2 22
> > cpu8 domain0 003c0 242 16 226
> > cpu8 domain1 3ffff 16 0 16
> > cpu9 domain0 003c0 0 0 0
> > cpu9 domain1 3ffff 6 1 5
> > cpu10 domain0 03c00 58 1 57
> > cpu10 domain1 3ffff 0 0 0
> > cpu11 domain0 03c00 54 4 50
> > cpu11 domain1 3ffff 1 0 1
> > cpu12 domain0 03c00 66 1 65
> > cpu12 domain1 3ffff 0 0 0
> > cpu13 domain0 03c00 66 1 65
> > cpu13 domain1 3ffff 0 0 0
> > cpu14 domain0 3c000 0 0 0
> > cpu14 domain1 3ffff 57 5 52
> > cpu15 domain0 3c000 15 0 15
> > cpu15 domain1 3ffff 35 0 35
> > cpu16 domain0 3c000 148 3 145
> > cpu16 domain1 3ffff 109 1 108
> > cpu17 domain0 3c000 182 2 180
> > cpu17 domain1 3ffff 78 1 77
> >
> > In __schedule(), before setting curr to next, during the execution of
> > pick_next_task(), sched_balance_rq() is called. It will unlock and then
> > re-lock the rq, creating "holes" during which other CPUs may see zero
> > rq->curr->on_rq. try_to_block_task() sets curr->on_rq to 0, and during the
> > rq lock "hole" in pick_next_task(), rq->curr has not yet been assigned to
> > next, resulting in curr->on_rq being seen as 0.
> >
>
> Aaah, it's a load balance shaped hole... Urgh, okay I see it now, thanks.
>
> > We do not need to perform active balancing when src_rq->curr is CFS task
> > but on_rq is 0, as other CFS tasks have been already checked just before.
> > For cases where src_rq->curr is a non-CFS task, we retain the affinity
> > check for dst_rq to trigger active balancing because such task is likely
> > to wake-up or woken-by src_rq CFS task which has similar affinity
> > characteristics to migrate.
> >
> > Two reasons why not check sched_class and on_rq of busiest->curr with the
> > cpumask_test_cpu() check:
> > 1. Let the PATCH not introduce new cases that skip logic for resetting
> > balance_interval to min_interval.
> > 2. The check of whether busiest cpu has been just triggered active balance
> > filters a bit more cases than the check of sched_class and on_rq.
> >
> > Signed-off-by: Xin Zhao <jackzxcui1989@xxxxxxx>
> > ---
> >
> > Change in v3:
> > - Consider the cost by sched_class and on_rq check,
> > as suggested by Aiqun(Maira) Yu.
> > Move the check after the check of whether busiest cpu has been just
> > triggered active balance.
> >
> > Change in v2:
> > - Add reason in the commit log why we can see zero rq->curr->on_rq when we
> > hold rq lock,
> > as suggested by Valentin Schneider.
> > - Link to v2: https://lore.kernel.org/all/20260613073228.1951105-1-jackzxcui1989@xxxxxxx/
> >
> > v1:
> > - Link to v1: https://lore.kernel.org/all/20260603125938.1938115-1-jackzxcui1989@xxxxxxx/
> > ---
> > kernel/sched/fair.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index b5819c489..1c5043629 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -13436,7 +13436,9 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> > * ->active_balance_work. Once set, it's cleared
> > * only after active load balance is finished.
> > */
> > - if (!busiest->active_balance) {
> > + if (!busiest->active_balance &&
> > + !(busiest->curr->sched_class == &fair_sched_class &&
> > + !busiest->curr->on_rq)) {
>
> This should be commented; I restructured this a little, it's one more label
> but it reads better to me, what do you think?

This below does look better to me as well, fwiw.

Cheers,
Phil

> ---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d78467ec6ee13..a0ac31a3be988 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -13482,12 +13482,22 @@ static int sched_balance_rq(int this_cpu, struct rq *this_rq,
> * ->active_balance_work. Once set, it's cleared
> * only after active load balance is finished.
> */
> - if (!busiest->active_balance) {
> - busiest->active_balance = 1;
> - busiest->push_cpu = this_cpu;
> - active_balance = 1;
> - }
> + if (busiest->active_balance)
> + goto no_active_balance;
>
> + /*
> + * @busiest dropped its rq_lock in the middle of
> + * scheduling out its ->curr task (->on_rq := 0), no
> + * need to forcefully punt it away with active balance.
> + */
> + if ((busiest->curr->sched_class == &fair_sched_class) &&
> + !busiest->curr->on_rq)
> + goto no_active_balance;
> +
> + busiest->active_balance = 1;
> + busiest->push_cpu = this_cpu;
> + active_balance = 1;
> +no_active_balance:
> preempt_disable();
> raw_spin_rq_unlock_irqrestore(busiest, flags);
> if (active_balance) {
>
>

--