Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

From: Vincent Guittot

Date: Fri Apr 24 2026 - 05:46:23 EST

On Wed, 22 Apr 2026 at 18:13, <imran.f.khan@xxxxxxxxxx> wrote:
>
> Hello Vincent,
> Thanks a lot for taking a look into this.
> On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> > On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
> >>
> >> On large scale systems, for example with 768 CPUs and cpusets consisting
> >> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> >> close to or same as now.
> >> This causes nohz.next_balance to be perpetually same as current jiffies and
> >> thus causing time based check in nohz_balancer_kick() to awlays fail.
> >>
> >> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> >> we can see that nohz.next_balance is at current jiffy at almost each tick:
> >>
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> >> 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
> >>
> >> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> >> to run almost every tick and this in turn can consume a lot of CPU cycles in
> >> subsequenet nohz idle balancing.
> >> So set nohz.next_balance based on number of currently idle CPUs, such that
> >> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> >> This will nohz_balancer_kick to bail out early.
> >>
> >> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
> >> ---
> >> kernel/sched/fair.c | 13 +++++++++++--
> >> 1 file changed, 11 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index ab4114712be74..bd35275a05b38 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
> >> * Increase nohz.next_balance only when if full ilb is triggered but
> >> * not if we only update stats.
> >> */
> >> - if (flags & NOHZ_BALANCE_KICK)
> >> - nohz.next_balance = jiffies+1;
> >
> > This +1 only cheaply prevents multiple nohz_ilb from happening
> > simultaneously during the current jiffies.
> >
> > The actual update of nohz.next_balance is done in _nohz_idle_balance()
> > and reflects the next balance of all idle rqs. You should look at the
> > balance interval of your sched_domains. The min interva is the weight
> > of the sched_domain which can be 2 at SMT level
> >
>
> I did not look at the balance interval of the involved sched domain.
> IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
> we will see that updated value in nohz_balancer_kick() and if its further
> from current jiffies, the time_before(now, nohz.next_balance) test would
> cause nohz_balancer_kick() to bail out without updating flags and that in
> tune would avoid kick_ilb() path.

yes

> Since jiffies and nohz.next_balance were appearing close or same in
> nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
> almost instantly and pretty much at frequency of each tick (dtrace snippet shown
> below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
> have had their rq->next_balance close to or same as current jiffies.

Yes

>
> ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
> ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3
>
> Could you please let me know if this understanding is incorrect ?

yes, it is correct.

The ILB is kicked for several reasons:
- NOHZ_BALANCE_KICK : periodic load balance based on the
balance_interval of each sched_domain
- NOHZ_STATS_KICK: update of statistics i.e. decaying the blocked load
- NOHZ_NEXT_KICK: loop on idle cpu to update nohz.next_balance when a
cpu becomes idle.

NOHZ_NEXT_KICK and NOHZ_STATS_KICK can be set independently for
"cheap" idle load balance

and NOHZ_STATS_KICK is set whenever NOHZ_BALANCE_KICK is set to take
advantage of the ILB to update the block load instead of kicking
anither one just for updating the stats.

>
> Regarding the question of sched_domain topology, this host
> has 768 CPUs and almost all (except 6) have been divided
> between 2 cpusets (one for each node). For example for node0
> CPUs we have:
>
> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
> root
> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
> 2-191,386-575
>
> and their sched_domains look like, as shown below:
>
> cpu2:
> domain0: cpus=2,386
> domain1: cpus=2-15,386-399
> domain2: cpus=2-191,386-575
> cpu3:
> domain0: cpus=3,387
> domain1: cpus=2-15,386-399
> domain2: cpus=2-191,386-575
> cpu4:
> domain0: cpus=4,388
> domain1: cpus=2-15,386-399
> domain2: cpus=2-191,386-575
> .....
> .....
>
> Could you please suggest if updating rq->next_balance or
> final nohz.next_balance by some other logic can help reduce the
> CPU usage of _nohz_idle_balance or should we just ignore it
> because CPU is idle anyways.

With SMT domain, the idle load balance will be kicked every 2 ms for
each core domain. If the load balance of all cores is not aligned on
the same tick, you will have an ILB every tick if there are activities
on some CPUs and we need to check whether it can be pulled on an idle
CPU. But it should be light

>
> On these systems I can see that CPU 2 is doing most of this work.
> Running a perf top on CPU 2 gives numbers like:
>
> 21.69% [kernel] [k] __update_blocked_fair
> 11.40% [kernel] [k] update_load_avg
> 9.36% [kernel] [k] __update_load_avg_cfs_rq
> 8.07% [kernel] [k] update_rq_clock
> 7.09% [kernel] [k] __update_load_avg_se
> 4.67% [kernel] [k] update_irq_load_avg
>
> .....
> .....
> 22.26% [kernel] [k] __update_blocked_fair
> 10.89% [kernel] [k] update_load_avg
> 9.65% [kernel] [k] __update_load_avg_cfs_rq
> 7.80% [kernel] [k] update_rq_clock
> 7.23% [kernel] [k] __update_load_avg_se
> 4.76% [kernel] [k] update_sg_lb_stats
>
> and mpstat also shows softirq usage of around 20-25% on CPU 2 and
> most of that is due to SCHED_SOFTIRQ leading into
> _nohz_idle_balance.

The time to update the blocked loads increases with the cgroup
hierarchy because we must to walk the hierarchy.

Does it generate problems for your system? As you mentioned above, if
CPU2 is idle, running such background activities should not cause
harm.

>
> Thanks,
> Imran
>
> PS: I used the following dtrace snippets to get nohz_balancer_kick
> data shown earlier and nohz_csd_func() data shown in this message.
>
> dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'
>
>
>
> fbt::nohz_csd_func:entry
> {
> this->rq = (struct rq *)arg0;
> this->rq_cpu = this->rq->cpu;
> this->rq_nohz_flags = this->rq->nohz_flags.counter;
> this->ts_ms = (unsigned long)(walltimestamp / 1000000);
> printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
> /*printf("[%lu] IPI received on cpu=%d\n",
> this->ts_ms, cpu);*/
> /*@ipi_rate[cpu] = count();*/
> }
>
> > Which kind of sched_domain topology do you have?
> >
> >
> >> + if (flags & NOHZ_BALANCE_KICK) {
> >> + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> >> +
> >> + /*
> >> + * On large systems, there may always be some idle CPU(s) with
> >> + * rq->next_balance close to or at current time, thus causing
> >> + * frequent invocation of kick_ilb() from nohz_balancer_kick().
> >> + * Adjust next_balance based on the number of idle CPUs.
> >> + */
> >> + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
> >> + }
> >>
> >> ilb_cpu = find_new_ilb();
> >> if (ilb_cpu < 0)
> >> --
> >> 2.34.1
> >>
>