Re: [PATCH v2] sched/fair: Don't trigger active lb if src_rq->curr is CFS and not on_rq
From: Aiqun(Maria) Yu
Date: Mon Jun 15 2026 - 08:31:19 EST
On 6/15/2026 5:01 PM, Xin Zhao wrote:
> On Mon, 15 Jun 2026 14:57:49 +0800 "Aiqun(Maria) Yu" <aiqun.yu@xxxxxxxxxxxxxxxx> wrote:
>
>>> This indeed reminds me that I should move the checks for curr->sched_class
>>> and curr->on_rq to a more appropriate place.
>>>
>>> The number of checks before executing active balancing will increase from
>>> two to three with this patch:
>>> 1. Do cpumask_test_cpu() for busiest->curr, ensure it can run on dst_cpu.
>>> 2. Confirm that there are no already triggered active balances for the
>>> busiest's run queue.
>>> 3. Check busiest->curr; if busiest->curr is a CFS task, it's on_rq should
>>> not be 0.
>>> Testing has shown that condition 1 filters out approximately 91.4% of all
>>
>> Could you describe the scenario under which this data was collected?
>> Without that context, the numbers don't give me a meaningful reference
>> point.
>
> The raw data of the test is like as follows:
> ...
> cpu6
> domain0 003c0
> IDLE i: 1881 5 8190 o: 82 27 27 26 0 isf: 23 22 16 8 0 isfl: 8 8 0 isft: 18050 18050 0 cmt: 8 0 8 0 cmtlc: 8/8 0 18050 0 isf_cannot_mt: 1 0 1 0 0 notf: 0 0 0 notft: 0 0
> NOT_I i: 0 0 1671 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
> NEWLY i: 147070 126 0 o: 10285 422 420 370 2 isf: 338 299 248 124 49 isfl: 124 124 39 isft: 217425 217425 47925 cmt: 171 0 124 47 cmtlc: 124/124 39 217425 46575 isf_cannot_mt: 39 0 39 0 1350 notf: 2 1 1 notft: 1850 325
> domain1 3ffff
> IDLE i: 4098 21 1693 o: 831 41 41 32 0 isf: 34 26 14 7 2 isfl: 7 7 2 isft: 17700 17700 2350 cmt: 9 0 7 2 cmtlc: 7/7 2 17700 2350 isf_cannot_mt: 8 0 8 0 0 notf: 0 0 0 notft: 0 0
> NOT_I i: 1 0 5 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
> NEWLY i: 194366 187 0 o: 68024 688 672 627 16 isf: 543 512 423 211 116 isfl: 212 211 98 isft: 445350 443250 129050 cmt: 327 0 212 115 cmtlc: 211/212 90 445350 127800 isf_cannot_mt: 31 0 31 0 1250 notf: 16 8 8 notft: 13000 8675
> ...
> cpu14
> domain0 3c000
> IDLE i: 171 0 4421 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
> NOT_I i: 0 0 824 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
> NEWLY i: 9083 0 0 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
> domain1 3ffff
> IDLE i: 5859 3 1218 o: 199 59 53 52 6 isf: 51 51 98 49 0 isfl: 50 50 0 isft: 141300 141300 0 cmt: 49 0 49 0 cmtlc: 50/50 0 141300 0 isf_cannot_mt: 0 0 0 0 0 notf: 6 6 0 notft: 17225 0
> NOT_I i: 0 0 2 o: 0 0 0 0 0 isf: 0 0 0 0 0 isfl: 0 0 0 isft: 0 0 0 cmt: 0 0 0 0 cmtlc: 0/0 0 0 0 isf_cannot_mt: 0 0 0 0 0 notf: 0 0 0 notft: 0 0
> NEWLY i: 354443 218 0 o: 19011 13417 8229 8177 5188 isf: 7952 7904 15166 7582 83 isfl: 7586 7584 79 isft: 23112800 23109750 85575 cmt: 7664 0 7583 81 cmtlc: 7584/7585 74 23111950 83425 isf_cannot_mt: 48 1 47 850 2150 notf: 5121 5098 23 notft: 16799950 25950
>
> the output data correspondent to the following code:
> seq_printf(seq, " i: %u %u %u o: %u %u %u %u %u isf: %u %u %u %u %u isfl: %u %u %u isft: %lu %lu %lu cmt: %u %u %u %u cmtlc: %u/%u %u %lu %lu isf_cannot_mt: %u %u %u %lu %lu notf: %u %u %u notft: %lu %lu\n",
> sd->balance_interval_min_count[itype], sd->balance_interval_min_fail_count[itype], sd->balance_interval_x2_count[itype],
> sd->out_one_pinnned_count[itype], sd->test_cpu_ok_count[itype], sd->subcheck_isfair_count[itype], sd->subcheck_isfair_on_rq_count[itype], sd->subcheck_notfair_count[itype],
> sd->alb_isfair_count[itype], sd->alb_isfair_on_rq_count[itype], sd->alb_isfair_succeed_count[itype], sd->alb_isfair_succeed_albtask_count[itype], sd->alb_isfair_fail_count[itype],
> sd->alb_isfair_succeed_loop[itype], sd->alb_isfair_succeed_albtask_loop[itype], sd->alb_isfair_fail_loop[itype],
> sd->alb_isfair_succeed_costns[itype], sd->alb_isfair_succeed_albtask_costns[itype], sd->alb_isfair_fail_costns[itype],
> sd->can_move_tail_count[itype], sd->move_tail_count[itype], sd->can_move_tail_success_count[itype], sd->can_move_tail_fail_count[itype],
> sd->can_move_tail_success_alb_task_loop[itype], sd->can_move_tail_success_loop[itype], sd->can_move_tail_fail_loop[itype], sd->can_move_tail_success_costns[itype], sd->can_move_tail_fail_costns[itype],
> sd->isfair_cannot_move_tail_count[itype], sd->isfair_cannot_move_tail_success_count[itype], sd->isfair_cannot_move_tail_fail_count[itype], sd->isfair_cannot_move_tail_success_costns[itype], sd->isfair_cannot_move_tail_fail_costns[itype],
> sd->alb_notfair_count[itype], sd->alb_notfair_suceed_count[itype], sd->alb_notfair_fail_count[itype],
> sd->alb_notfair_succeed_costns[itype], sd->alb_notfair_fail_costns[itype]);
>
> what we interest now is the statistics of the following code branch:
>
> #ifdef CONFIG_SCHEDSTATS_DEBUG
> if (cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
> schedstat_debug_inc(sd->test_cpu_ok_count[idle]);
> if (busiest->curr->sched_class == &fair_sched_class) {
> schedstat_debug_inc(sd->subcheck_isfair_count[idle]);
> isfair = 1;
> if (busiest->curr->on_rq) {
> schedstat_debug_inc(sd->subcheck_isfair_on_rq_count[idle]);
> }
> } else {
> schedstat_debug_inc(sd->subcheck_notfair_count[idle]);
> isfair = 0;
> }
> } else {
> schedstat_debug_inc(sd->out_one_pinnned_count[idle]);
> raw_spin_rq_unlock_irqrestore(busiest, flags);
> goto out_one_pinned;
> }
> #else
> if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
> raw_spin_rq_unlock_irqrestore(busiest, flags);
> goto out_one_pinned;
> }
> #endif
>
> lets see 'o: 68024 688 672 627 16' as an example:
> the first number after 'o:' is count of 'goto out_one_pinned;'
This is highly depend on if you have the scenario like task affined
specificlly for example.
What about the task is not affined and scenarios?
When I talk about scenario, I would like to see how many tasks, what's
the affinity setting etc. Or is it a reproducible benchmark which can be
replicated from my end as well.
> the second number after 'o:' is count of 'cpumask_test_cpu' test ok
> the third number after 'o:' is count of 'cpumask_test_cpu' test ok and is fair class
> the fourth number after 'o:' is count of 'cpumask_test_cpu' test ok and is fair class and is on_rq
>
> 91.4% is the sum of first number / the sum of first number plus the sum of second number
> 2.36% is the sum of fourth number / the sum of second number
>
> see 'isf: 543 512 423 211 116' as an example and the correspondent debug code as below:
> the first number after 'isf' is count of alb_isfair_count
> the second number after 'isf' is count of alb_isfair_on_rq_count
>
> 5.4% is the approximate ratio of the sample type value like
> the second number after 'isf' is count of alb_isfair_on_rq_count (including the check of !busiest->active_balance)
> div
> the fourth number after 'o:' is count of 'cpumask_test_cpu' test ok and is fair class and is on_rq
>
> if (!busiest->active_balance) {
> busiest->active_balance = 1;
> busiest->idle_type = idle;
> busiest->sd_alb = sd;
> busiest->can_move_tail = 0;
> busiest->move_tail = 0;
> if (isfair) {
> busiest->isfair = 1;
> schedstat_debug_inc(sd->alb_isfair_count[idle]);
> if (busiest->curr->on_rq) {
> busiest->can_move_tail = 1;
> schedstat_debug_inc(sd->alb_isfair_on_rq_count[idle]);
> if (use_list_move_tail) {
> list_move_tail(&busiest->curr->se.group_node, &busiest->cfs_tasks);
> busiest->move_tail = 1;
> }
> } else {
> //if (printk_ratelimit())
> // {
> // printk("zhaoxin_rq:trigger_ab:cpu[%d]busiest[%d]busiest->curr[%d][%s]busiest->curr->on_rq[%d]\n",
> // smp_processor_id(), busiest->cpu, busiest->curr->pid, busiest->curr->comm, busiest->curr->on_rq);
> // }
> }
> busiest->alb_task = busiest->curr;
> } else {
> busiest->isfair = 0;
> schedstat_debug_inc(sd->alb_notfair_count[idle]);
> busiest->alb_task = NULL;
> }
> busiest->push_cpu = this_cpu;
> active_balance = 1;
> }
>
> I wish you can understand.
>
>
>>> cases, which is a sufficiently high filtering rate. Therefore, conditions
>>> 2 and 3 should be evaluated based on this condition. If we consider the
>>> samples filtered out by condition 1, condition 2 will filter out about
>>> 5.4% of the cases, and condition 3 will filter out about 2.36% of the
>>
>> Shall we also have busiest->curr is not a CFS task as a separate condition?
>
>>> of code: why there isn't a further check for fair_sched_class. My test
>>> data shows that if the cpumask_test_cpu test is satisfied and
>>> busiest->curr is not a CFS task, the success rate of active balancing
>>> reaches as high as 98.7%. This result is clearly different from our
>>> initial expectations.
>>
>> Since the busiest rq lock was newly hold, so it is potentially have new
>> conditions like current task is not cfs task running and don't need to
>> do the active load balance.
>> So if the current busiest_rq->cfs_tasks is not empty, the light weight
>> best effort balance is just detach from busiest_rq and attach to this rq.
>
>>> I believe this could be attributed to a few reasons:
>>> 1. The proportion of real-time tasks in the system is generally quite
>>> small, so they are more likely to occupy busiest->curr for only a brief
>>> period. The CFS tasks we want to migrate may easily be "buried" by this
>>> recently executed real-time task.
>>> 2. If this task's CPU can run on dst_cpu, it indicates that the real-time
>>> task is correlated with dst_cpu. Real-time tasks often trigger new
>>> associated CFS tasks, which increases the success rate of executing active
>>> load balancing. I also mentioned this point in the commit log.
>>>
>>
>> The current task is changed to other higher priority task, either it is
>> rt or other non-cfs task, it is possible that the previous identified
>> cfs task is not current running, and don't need active load balance at all.
>
> Maybe you believe the busiest->curr should be checked to see if it is a CFS
> task; if it is not, then active balancing should not be performed. However,
> as I mentioned, I have tested the case where the cpumask_test_cpu passes but
> busiest->curr is not a CFS task. I found that the success rate of active
> balancing in this scenario is 98.7%. This number is obtained through the
Maybe if it just do detach_task and attach_task the success rate will be
100%? My understand is if the curr task is not CFS task, active load
balance is not necessary at all.
> following calculation:
>
> See 'notf: 5121 5098 23' as an example, the correspondent statistics code:
> sd->alb_notfair_count[itype], sd->alb_notfair_suceed_count[itype], sd->alb_notfair_fail_count[itype],
>
> alb_notfair_suceed_count is inc by the following code in active_load_balance_cpu_stop():
>
> if (busiest_rq->isfair) {
> ...
> } else {
> if (p) {
> schedstat_debug_inc(busiest_rq->sd_alb->alb_notfair_suceed_count[busiest_rq->idle_type]);
> schedstat_debug_add(busiest_rq->sd_alb->alb_notfair_succeed_costns[busiest_rq->idle_type], span);
> } else {
> schedstat_debug_inc(busiest_rq->sd_alb->alb_notfair_fail_count[busiest_rq->idle_type]);
> schedstat_debug_add(busiest_rq->sd_alb->alb_notfair_fail_costns[busiest_rq->idle_type], span);
> }
> }
>
>>> I believe this could be attributed to a few reasons:
>>> 1. The proportion of real-time tasks in the system is generally quite
>>> small, so they are more likely to occupy busiest->curr for only a brief
>>> period. The CFS tasks we want to migrate may easily be "buried" by this
>>> recently executed real-time task.
>
> Consider the following:
>
> T0 src_rq check CFS task 'p'(cpumask_test_cpu is ok) but is on_cpu(busiest) so cannot migrate
> T1 unlock busiest rq
> T2 busiest cpu run a high-prio task preempt CFS task 'p'
> T3 src_rq check the busiest->curr it is not CFS
>
> So I said the task 'p' maybe "buried" by high-prio tasks.
>
> Thanks
> Xin Zhao
>
--
Thx and BRs,
Aiqun(Maria) Yu