Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

From: imran . f . khan

Date: Wed Apr 22 2026 - 12:18:32 EST

Hello Vincent,
Thanks a lot for taking a look into this.
On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
>>
>> On large scale systems, for example with 768 CPUs and cpusets consisting
>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
>> close to or same as now.
>> This causes nohz.next_balance to be perpetually same as current jiffies and
>> thus causing time based check in nohz_balancer_kick() to awlays fail.
>>
>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
>> we can see that nohz.next_balance is at current jiffy at almost each tick:
>>
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
>> 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>>
>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
>> to run almost every tick and this in turn can consume a lot of CPU cycles in
>> subsequenet nohz idle balancing.
>> So set nohz.next_balance based on number of currently idle CPUs, such that
>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
>> This will nohz_balancer_kick to bail out early.
>>
>> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
>> ---
>> kernel/sched/fair.c | 13 +++++++++++--
>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index ab4114712be74..bd35275a05b38 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>> * Increase nohz.next_balance only when if full ilb is triggered but
>> * not if we only update stats.
>> */
>> - if (flags & NOHZ_BALANCE_KICK)
>> - nohz.next_balance = jiffies+1;
>
> This +1 only cheaply prevents multiple nohz_ilb from happening
> simultaneously during the current jiffies.
>
> The actual update of nohz.next_balance is done in _nohz_idle_balance()
> and reflects the next balance of all idle rqs. You should look at the
> balance interval of your sched_domains. The min interva is the weight
> of the sched_domain which can be 2 at SMT level
>

I did not look at the balance interval of the involved sched domain.
IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
we will see that updated value in nohz_balancer_kick() and if its further
from current jiffies, the time_before(now, nohz.next_balance) test would
cause nohz_balancer_kick() to bail out without updating flags and that in
tune would avoid kick_ilb() path.
Since jiffies and nohz.next_balance were appearing close or same in
nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
almost instantly and pretty much at frequency of each tick (dtrace snippet shown
below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
have had their rq->next_balance close to or same as current jiffies.

ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3

Could you please let me know if this understanding is incorrect ?

Regarding the question of sched_domain topology, this host
has 768 CPUs and almost all (except 6) have been divided
between 2 cpusets (one for each node). For example for node0
CPUs we have:

# cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
root
# cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
2-191,386-575

and their sched_domains look like, as shown below:

cpu2:
domain0: cpus=2,386
domain1: cpus=2-15,386-399
domain2: cpus=2-191,386-575
cpu3:
domain0: cpus=3,387
domain1: cpus=2-15,386-399
domain2: cpus=2-191,386-575
cpu4:
domain0: cpus=4,388
domain1: cpus=2-15,386-399
domain2: cpus=2-191,386-575
.....
.....

Could you please suggest if updating rq->next_balance or
final nohz.next_balance by some other logic can help reduce the
CPU usage of _nohz_idle_balance or should we just ignore it
because CPU is idle anyways.

On these systems I can see that CPU 2 is doing most of this work.
Running a perf top on CPU 2 gives numbers like:

21.69% [kernel] [k] __update_blocked_fair
11.40% [kernel] [k] update_load_avg
9.36% [kernel] [k] __update_load_avg_cfs_rq
8.07% [kernel] [k] update_rq_clock
7.09% [kernel] [k] __update_load_avg_se
4.67% [kernel] [k] update_irq_load_avg

.....
.....
22.26% [kernel] [k] __update_blocked_fair
10.89% [kernel] [k] update_load_avg
9.65% [kernel] [k] __update_load_avg_cfs_rq
7.80% [kernel] [k] update_rq_clock
7.23% [kernel] [k] __update_load_avg_se
4.76% [kernel] [k] update_sg_lb_stats

and mpstat also shows softirq usage of around 20-25% on CPU 2 and
most of that is due to SCHED_SOFTIRQ leading into
_nohz_idle_balance.

Thanks,
Imran

PS: I used the following dtrace snippets to get nohz_balancer_kick
data shown earlier and nohz_csd_func() data shown in this message.

dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'

fbt::nohz_csd_func:entry
{
this->rq = (struct rq *)arg0;
this->rq_cpu = this->rq->cpu;
this->rq_nohz_flags = this->rq->nohz_flags.counter;
this->ts_ms = (unsigned long)(walltimestamp / 1000000);
printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
/*printf("[%lu] IPI received on cpu=%d\n",
this->ts_ms, cpu);*/
/*@ipi_rate[cpu] = count();*/
}

> Which kind of sched_domain topology do you have?
>
>
>> + if (flags & NOHZ_BALANCE_KICK) {
>> + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
>> +
>> + /*
>> + * On large systems, there may always be some idle CPU(s) with
>> + * rq->next_balance close to or at current time, thus causing
>> + * frequent invocation of kick_ilb() from nohz_balancer_kick().
>> + * Adjust next_balance based on the number of idle CPUs.
>> + */
>> + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
>> + }
>>
>> ilb_cpu = find_new_ilb();
>> if (ilb_cpu < 0)
>> --
>> 2.34.1
>>