Re: [PATCH v2] sched/fair: Don't trigger active lb if src_rq->curr is CFS and not on_rq

From: Xin Zhao

Date: Sun Jun 14 2026 - 08:05:32 EST

On Sun, 14 Jun 2026 12:17:37 +0800 "Aiqun(Maria) Yu" <aiqun.yu@xxxxxxxxxxxxxxxx> wrote:

> What's the probability that curr->on_rq is 1 throughout the entire check?
>
> >
> > In __schedule(), before setting curr to next, during the execution of
> > pick_next_task(), sched_balance_rq() is called. It will unlock and then
> > re-lock the rq, creating "holes" during which other CPUs may see zero
> > rq->curr->on_rq. This situation occurs quite frequently because:
> > 1. Periodic load balancing across CPUs often happens in close succession,
> > leading to collisions in the rq lock during sched_balance_rq().
> > 2. try_to_block_task() sets curr->on_rq to 0, and during the rq lock
> > "hole" in pick_next_task(), rq->curr has not yet been assigned to next,
> > resulting in curr->on_rq being seen as 0.
>
> It is possible that curr->on_rq is seen as 0 and don't need to do active
> balance.
> While my concern is the overhead of check "curr->on_rq" every time to
> the possibility of curr->on_rq should be considered as well.

This indeed reminds me that I should move the checks for curr->sched_class
and curr->on_rq to a more appropriate place.

The number of checks before executing active balancing will increase from
two to three with this patch:
1. Do cpumask_test_cpu() for busiest->curr, ensure it can run on dst_cpu.
2. Confirm that there are no already triggered active balances for the
busiest's run queue.
3. Check busiest->curr; if busiest->curr is a CFS task, it's on_rq should
not be 0.
Testing has shown that condition 1 filters out approximately 91.4% of all
cases, which is a sufficiently high filtering rate. Therefore, conditions
2 and 3 should be evaluated based on this condition. If we consider the
samples filtered out by condition 1, condition 2 will filter out about
5.4% of the cases, and condition 3 will filter out about 2.36% of the
cases.

Thus, it is better to place the filtering of condition 3 after the
filtering of condition 2. This change will be reflected in PATCH v3.
Additionally, this modification will not introduce new cases that skip the
logic for resetting balance_interval to min_interval.

> > Additionally, in sched_balance_rq(), we unconditionally reset the
> > balance_interval to min_interval. The difference is that original logic
> > does not reset the balance_interval when dst_cpu softirq handler is
> > preempted while src_cpu successfully run the just-dispatched active
> > balancing, during the gaps between two need_active_balance() checks. It
> > seems that we haven't observed any substantial benefits from reducing the
> > opportunities for balance under such fluctuating conditions. So simplify
>
> This is not that clear to me. And could you pls help to have more data
> and example to have more details?
> Maybe it can be separate with different patch?

The changes in this part do indeed seem subtle. I think you're right; they
should be separated into an independent patch with a detailed explanation.

> > - if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr)) {
> > + if (!cpumask_test_cpu(this_cpu, busiest->curr->cpus_ptr) ||
> > + (busiest->curr->sched_class == &fair_sched_class &&
> > + !busiest->curr->on_rq)) {
>
> if busiest->curr->sched_class != &fair_sched_class, do we really need to
> do active balance here?
>
> Also maybe unlikely(!busiest->curr->on_rq) instead.

This is also one of the points that puzzled me when I first saw this piece
of code: why there isn't a further check for fair_sched_class. My test
data shows that if the cpumask_test_cpu test is satisfied and
busiest->curr is not a CFS task, the success rate of active balancing
reaches as high as 98.7%. This result is clearly different from our
initial expectations.

I believe this could be attributed to a few reasons:
1. The proportion of real-time tasks in the system is generally quite
small, so they are more likely to occupy busiest->curr for only a brief
period. The CFS tasks we want to migrate may easily be "buried" by this
recently executed real-time task.
2. If this task's CPU can run on dst_cpu, it indicates that the real-time
task is correlated with dst_cpu. Real-time tasks often trigger new
associated CFS tasks, which increases the success rate of executing active
load balancing. I also mentioned this point in the commit log.

> >
> > - if (likely(!active_balance) || need_active_balance(&env)) {
>
> If the active_balance already triggered, why still need reset balancing
> interval?

As mentioned earlier, this part is not very obvious. I will separate this
section and provide a detailed explanation in PATCH v3.

Thanks
Xin Zhao