Re: [PATCH v2] sched/fair: Don't trigger active lb if src_rq->curr is CFS and not on_rq

From: Xin Zhao

Date: Mon Jun 15 2026 - 05:06:25 EST


On Mon, 15 Jun 2026 14:57:49 +0800 "Aiqun(Maria) Yu" <aiqun.yu@xxxxxxxxxxxxxxxx> wrote:

> > This indeed reminds me that I should move the checks for curr->sched_class
> > and curr->on_rq to a more appropriate place.
> >
> > The number of checks before executing active balancing will increase from
> > two to three with this patch:
> > 1. Do cpumask_test_cpu() for busiest->curr, ensure it can run on dst_cpu.
> > 2. Confirm that there are no already triggered active balances for the
> > busiest's run queue.
> > 3. Check busiest->curr; if busiest->curr is a CFS task, it's on_rq should
> > not be 0.
> > Testing has shown that condition 1 filters out approximately 91.4% of all
>
> Could you describe the scenario under which this data was collected?
> Without that context, the numbers don't give me a meaningful reference
> point.

The raw data of the test is like as follows:
...
cpu6
domain0 003c0
IDLE i: 1881 5 8190 o: 82 27 27 26 0 isf: 23 22 16 8 0 isfl: 8 8 0 isft: 18050 18050 0 cmt: 8 0 8 0 cmtlc: 8/8 0 18050 0 isf_cannot_mt: 1 0 1 0 0 notf: 0 0 0 notft: 0 0
NOT_I i: 0 0 1671 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
NEWLY i: 147070 126 0 o: 10285 422 420 370 2 isf: 338 299 248 124 49 isfl: 124 124 39 isft: 217425 217425 47925 cmt: 171 0 124 47 cmtlc: 124/124 39 217425 46575 isf_cannot_mt: 39 0 39 0 1350 notf: 2 1 1 notft: 1850 325
domain1 3ffff
IDLE i: 4098 21 1693 o: 831 41 41 32 0 isf: 34 26 14 7 2 isfl: 7 7 2 isft: 17700 17700 2350 cmt: 9 0 7 2 cmtlc: 7/7 2 17700 2350 isf_cannot_mt: 8 0 8 0 0 notf: 0 0 0 notft: 0 0
NOT_I i: 1 0 5 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
NEWLY i: 194366 187 0 o: 68024 688 672 627 16 isf: 543 512 423 211 116 isfl: 212 211 98 isft: 445350 443250 129050 cmt: 327 0 212 115 cmtlc: 211/212 90 445350 127800 isf_cannot_mt: 31 0 31 0 1250 notf: 16 8 8 notft: 13000 8675
...
cpu14
domain0 3c000
IDLE i: 171 0 4421 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
NOT_I i: 0 0 824 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
NEWLY i: 9083 0 0 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
domain1 3ffff
IDLE i: 5859 3 1218 o: 199 59 53 52 6 isf: 51 51 98 49 0 isfl: 50 50 0 isft: 141300 141300 0 cmt: 49 0 49 0 cmtlc: 50/50 0 141300 0 isf_cannot_mt: 0 0 0 0 0 notf: 6 6 0 notft: 17225 0
NOT_I i: 0 0 2 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
NEWLY i: 354443 218 0 o: 19011 13417 8229 8177 5188 isf: 7952 7904 15166 7582 83 isfl: 7586 7584 79 isft: 23112800 23109750 85575 cmt: 7664 0 7583 81 cmtlc: 7584/7585 74 23111950 83425 isf_cannot_mt: 48 1 47 850 2150 notf: 5121 5098 23 notft: 16799950 25950

the output data correspondent to the following code:
seq_printf(seq, " i: %u %u %u o: %u %u %u %u %u isf: %u %u %u %u %u isfl: %u %u %u isft: %lu %lu %lu cmt: %u %u %u %u cmtlc: %u/%u %u %lu %lu isf_cannot_mt: %u %u %u %lu %lu notf: %u %u %u notft: %lu %lu\n",
sd->balance_interval_min_count[itype], sd->balance_interval_min_fail_count[itype], sd->balance_interval_x2_count[itype],
sd->out_one_pinnned_count[itype], sd->test_cpu_ok_count[itype], sd->subcheck_isfair_count[itype], sd->subcheck_isfair_on_rq_count[itype], sd->subcheck_notfair_count[itype],
sd->alb_isfair_count[itype], sd->alb_isfair_on_rq_count[itype], sd->alb_isfair_succeed_count[itype], sd->alb_isfair_succeed_albtask_count[itype], sd->alb_isfair_fail_count[itype],
sd->alb_isfair_succeed_loop[itype], sd->alb_isfair_succeed_albtask_loop[itype], sd->alb_isfair_fail_loop[itype],
sd->alb_isfair_succeed_costns[itype], sd->alb_isfair_succeed_albtask_costns[itype], sd->alb_isfair_fail_costns[itype],
sd->can_move_tail_count[itype], sd->move_tail_count[itype], sd->can_move_tail_success_count[itype], sd->can_move_tail_fail_count[itype],
sd->can_move_tail_success_alb_task_loop[itype], sd->can_move_tail_success_loop[itype], sd->can_move_tail_fail_loop[itype], sd->can_move_tail_success_costns[itype], sd->can_move_tail_fail_costns[itype],
sd->isfair_cannot_move_tail_count[itype], sd->isfair_cannot_move_tail_success_count[itype], sd->isfair_cannot_move_tail_fail_count[itype], sd->isfair_cannot_move_tail_success_costns[itype], sd->isfair_cannot_move_tail_fail_costns[itype],
sd->alb_notfair_count[itype], sd->alb_notfair_suceed_count[itype], sd->alb_notfair_fail_count[itype],
sd->alb_notfair_succeed_costns[itype], sd->alb_notfair_fail_costns[itype]);

what we interest now is the statistics of the following code branch:

#ifdef CONFIG_SCHEDSTATS_DEBUG
if (cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
schedstat_debug_inc(sd->test_cpu_ok_count[idle]);
if (busiest->curr->sched_class == &fair_sched_class) {
schedstat_debug_inc(sd->subcheck_isfair_count[idle]);
isfair = 1;
if (busiest->curr->on_rq) {
schedstat_debug_inc(sd->subcheck_isfair_on_rq_count[idle]);
}
} else {
schedstat_debug_inc(sd->subcheck_notfair_count[idle]);
isfair = 0;
}
} else {
schedstat_debug_inc(sd->out_one_pinnned_count[idle]);
raw_spin_rq_unlock_irqrestore(busiest, flags);
goto out_one_pinned;
}
#else
if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
raw_spin_rq_unlock_irqrestore(busiest, flags);
goto out_one_pinned;
}
#endif

lets see 'o: 68024 688 672 627 16' as an example:
the first number after 'o:' is count of 'goto out_one_pinned;'
the second number after 'o:' is count of 'cpumask_test_cpu' test ok
the third number after 'o:' is count of 'cpumask_test_cpu' test ok and is fair class
the fourth number after 'o:' is count of 'cpumask_test_cpu' test ok and is fair class and is on_rq

91.4% is the sum of first number / the sum of first number plus the sum of second number
2.36% is the sum of fourth number / the sum of second number

see 'isf: 543 512 423 211 116' as an example and the correspondent debug code as below:
the first number after 'isf' is count of alb_isfair_count
the second number after 'isf' is count of alb_isfair_on_rq_count

5.4% is the approximate ratio of the sample type value like
the second number after 'isf' is count of alb_isfair_on_rq_count (including the check of !busiest->active_balance)
div
the fourth number after 'o:' is count of 'cpumask_test_cpu' test ok and is fair class and is on_rq

if (!busiest->active_balance) {
busiest->active_balance = 1;
busiest->idle_type = idle;
busiest->sd_alb = sd;
busiest->can_move_tail = 0;
busiest->move_tail = 0;
if (isfair) {
busiest->isfair = 1;
schedstat_debug_inc(sd->alb_isfair_count[idle]);
if (busiest->curr->on_rq) {
busiest->can_move_tail = 1;
schedstat_debug_inc(sd->alb_isfair_on_rq_count[idle]);
if (use_list_move_tail) {
list_move_tail(&busiest->curr->se.group_node, &busiest->cfs_tasks);
busiest->move_tail = 1;
}
} else {
//if (printk_ratelimit())
// {
// printk("zhaoxin_rq:trigger_ab:cpu[%d]busiest[%d]busiest->curr[%d][%s]busiest->curr->on_rq[%d]\n",
// smp_processor_id(), busiest->cpu, busiest->curr->pid, busiest->curr->comm, busiest->curr->on_rq);
// }
}
busiest->alb_task = busiest->curr;
} else {
busiest->isfair = 0;
schedstat_debug_inc(sd->alb_notfair_count[idle]);
busiest->alb_task = NULL;
}
busiest->push_cpu = this_cpu;
active_balance = 1;
}

I wish you can understand.


> > cases, which is a sufficiently high filtering rate. Therefore, conditions
> > 2 and 3 should be evaluated based on this condition. If we consider the
> > samples filtered out by condition 1, condition 2 will filter out about
> > 5.4% of the cases, and condition 3 will filter out about 2.36% of the
>
> Shall we also have busiest->curr is not a CFS task as a separate condition?

> > of code: why there isn't a further check for fair_sched_class. My test
> > data shows that if the cpumask_test_cpu test is satisfied and
> > busiest->curr is not a CFS task, the success rate of active balancing
> > reaches as high as 98.7%. This result is clearly different from our
> > initial expectations.
>
> Since the busiest rq lock was newly hold, so it is potentially have new
> conditions like current task is not cfs task running and don't need to
> do the active load balance.
> So if the current busiest_rq->cfs_tasks is not empty, the light weight
> best effort balance is just detach from busiest_rq and attach to this rq.

> > I believe this could be attributed to a few reasons:
> > 1. The proportion of real-time tasks in the system is generally quite
> > small, so they are more likely to occupy busiest->curr for only a brief
> > period. The CFS tasks we want to migrate may easily be "buried" by this
> > recently executed real-time task.
> > 2. If this task's CPU can run on dst_cpu, it indicates that the real-time
> > task is correlated with dst_cpu. Real-time tasks often trigger new
> > associated CFS tasks, which increases the success rate of executing active
> > load balancing. I also mentioned this point in the commit log.
> >
>
> The current task is changed to other higher priority task, either it is
> rt or other non-cfs task, it is possible that the previous identified
> cfs task is not current running, and don't need active load balance at all.

Maybe you believe the busiest->curr should be checked to see if it is a CFS
task; if it is not, then active balancing should not be performed. However,
as I mentioned, I have tested the case where the cpumask_test_cpu passes but
busiest->curr is not a CFS task. I found that the success rate of active
balancing in this scenario is 98.7%. This number is obtained through the
following calculation:

See 'notf: 5121 5098 23' as an example, the correspondent statistics code:
sd->alb_notfair_count[itype], sd->alb_notfair_suceed_count[itype], sd->alb_notfair_fail_count[itype],

alb_notfair_suceed_count is inc by the following code in active_load_balance_cpu_stop():

if (busiest_rq->isfair) {
...
} else {
if (p) {
schedstat_debug_inc(busiest_rq->sd_alb->alb_notfair_suceed_count[busiest_rq->idle_type]);
schedstat_debug_add(busiest_rq->sd_alb->alb_notfair_succeed_costns[busiest_rq->idle_type], span);
} else {
schedstat_debug_inc(busiest_rq->sd_alb->alb_notfair_fail_count[busiest_rq->idle_type]);
schedstat_debug_add(busiest_rq->sd_alb->alb_notfair_fail_costns[busiest_rq->idle_type], span);
}
}

> > I believe this could be attributed to a few reasons:
> > 1. The proportion of real-time tasks in the system is generally quite
> > small, so they are more likely to occupy busiest->curr for only a brief
> > period. The CFS tasks we want to migrate may easily be "buried" by this
> > recently executed real-time task.

Consider the following:

T0 src_rq check CFS task 'p'(cpumask_test_cpu is ok) but is on_cpu(busiest) so cannot migrate
T1 unlock busiest rq
T2 busiest cpu run a high-prio task preempt CFS task 'p'
T3 src_rq check the busiest->curr it is not CFS

So I said the task 'p' maybe "buried" by high-prio tasks.

Thanks
Xin Zhao