Re: [PATCH v2] sched/fair: Don't trigger active lb if src_rq->curr is CFS and not on_rq

From: Aiqun(Maria) Yu

Date: Mon Jun 15 2026 - 03:01:59 EST

On 6/14/2026 8:03 PM, Xin Zhao wrote:
> On Sun, 14 Jun 2026 12:17:37 +0800 "Aiqun(Maria) Yu" <aiqun.yu@xxxxxxxxxxxxxxxx> wrote:
>
>> What's the probability that curr->on_rq is 1 throughout the entire check?
>>
>>>
>>> In __schedule(), before setting curr to next, during the execution of
>>> pick_next_task(), sched_balance_rq() is called. It will unlock and then
>>> re-lock the rq, creating "holes" during which other CPUs may see zero
>>> rq->curr->on_rq. This situation occurs quite frequently because:
>>> 1. Periodic load balancing across CPUs often happens in close succession,
>>> leading to collisions in the rq lock during sched_balance_rq().
>>> 2. try_to_block_task() sets curr->on_rq to 0, and during the rq lock
>>> "hole" in pick_next_task(), rq->curr has not yet been assigned to next,
>>> resulting in curr->on_rq being seen as 0.
>>
>> It is possible that curr->on_rq is seen as 0 and don't need to do active
>> balance.
>> While my concern is the overhead of check "curr->on_rq" every time to
>> the possibility of curr->on_rq should be considered as well.
>
> This indeed reminds me that I should move the checks for curr->sched_class
> and curr->on_rq to a more appropriate place.
>
> The number of checks before executing active balancing will increase from
> two to three with this patch:
> 1. Do cpumask_test_cpu() for busiest->curr, ensure it can run on dst_cpu.
> 2. Confirm that there are no already triggered active balances for the
> busiest's run queue.
> 3. Check busiest->curr; if busiest->curr is a CFS task, it's on_rq should
> not be 0.
> Testing has shown that condition 1 filters out approximately 91.4% of all

Could you describe the scenario under which this data was collected?
Without that context, the numbers don't give me a meaningful reference
point.

> cases, which is a sufficiently high filtering rate. Therefore, conditions
> 2 and 3 should be evaluated based on this condition. If we consider the
> samples filtered out by condition 1, condition 2 will filter out about
> 5.4% of the cases, and condition 3 will filter out about 2.36% of the

Shall we also have busiest->curr is not a CFS task as a separate condition?

> cases.
>
> Thus, it is better to place the filtering of condition 3 after the
> filtering of condition 2. This change will be reflected in PATCH v3.
> Additionally, this modification will not introduce new cases that skip the
> logic for resetting balance_interval to min_interval.
>
>
>>> Additionally, in sched_balance_rq(), we unconditionally reset the
>>> balance_interval to min_interval. The difference is that original logic
>>> does not reset the balance_interval when dst_cpu softirq handler is
>>> preempted while src_cpu successfully run the just-dispatched active
>>> balancing, during the gaps between two need_active_balance() checks. It
>>> seems that we haven't observed any substantial benefits from reducing the
>>> opportunities for balance under such fluctuating conditions. So simplify
>>
>> This is not that clear to me. And could you pls help to have more data
>> and example to have more details?
>> Maybe it can be separate with different patch?
>
> The changes in this part do indeed seem subtle. I think you're right; they
> should be separated into an independent patch with a detailed explanation.
>
>
>>> - if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
>>> + if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr) ||
>>> + (busiest->curr->sched_class == &fair_sched_class &&
>>> + !busiest->curr->on_rq)) {
>>
>> if busiest->curr->sched_class != &fair_sched_class, do we really need to
>> do active balance here?
>>
>> Also maybe unlikely(!busiest->curr->on_rq) instead.
>
> This is also one of the points that puzzled me when I first saw this piece
> of code: why there isn't a further check for fair_sched_class. My test
> data shows that if the cpumask_test_cpu test is satisfied and
> busiest->curr is not a CFS task, the success rate of active balancing
> reaches as high as 98.7%. This result is clearly different from our
> initial expectations.

Since the busiest rq lock was newly hold, so it is potentially have new
conditions like current task is not cfs task running and don't need to
do the active load balance.
So if the current busiest_rq->cfs_tasks is not empty, the light weight
best effort balance is just detach from busiest_rq and attach to this rq.

>
> I believe this could be attributed to a few reasons:
> 1. The proportion of real-time tasks in the system is generally quite
> small, so they are more likely to occupy busiest->curr for only a brief
> period. The CFS tasks we want to migrate may easily be "buried" by this
> recently executed real-time task.
> 2. If this task's CPU can run on dst_cpu, it indicates that the real-time
> task is correlated with dst_cpu. Real-time tasks often trigger new
> associated CFS tasks, which increases the success rate of executing active
> load balancing. I also mentioned this point in the commit log.
>

The current task is changed to other higher priority task, either it is
rt or other non-cfs task, it is possible that the previous identified
cfs task is not current running, and don't need active load balance at all.

>
>>>
>>> - if (likely(!active_balance) || need_active_balance(&env)) {
>>
>> If the active_balance already triggered, why still need reset balancing
>> interval?
>
> As mentioned earlier, this part is not very obvious. I will separate this
> section and provide a detailed explanation in PATCH v3.

great thx.

>
>
> Thanks
> Xin Zhao
>

--
Thx and BRs,
Aiqun(Maria) Yu