Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

From: Vincent Guittot

Date: Tue Apr 28 2026 - 11:07:55 EST


On Tue, 28 Apr 2026 at 12:53, <imran.f.khan@xxxxxxxxxx> wrote:
>
> Hello Vincent,
> Thanks so much for clarifying my queries.
> On 24/4/2026 5:46 pm, Vincent Guittot wrote:
> > On Wed, 22 Apr 2026 at 18:13, <imran.f.khan@xxxxxxxxxx> wrote:
> >>
> >> Hello Vincent,
> >> Thanks a lot for taking a look into this.
> >> On 22/4/2026 3:54 pm, Vincent Guittot wrote:
> >>> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
> >>>>
> >>>> On large scale systems, for example with 768 CPUs and cpusets consisting
> >>>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> >>>> close to or same as now.
> >>>> This causes nohz.next_balance to be perpetually same as current jiffies and
> >>>> thus causing time based check in nohz_balancer_kick() to awlays fail.
> >>>>
> >>>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> >>>> we can see that nohz.next_balance is at current jiffy at almost each tick:
> >>>>
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> >>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
> >>>>
> >>>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> >>>> to run almost every tick and this in turn can consume a lot of CPU cycles in
> >>>> subsequenet nohz idle balancing.
> >>>> So set nohz.next_balance based on number of currently idle CPUs, such that
> >>>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> >>>> This will nohz_balancer_kick to bail out early.
> >>>>
> >>>> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
> >>>> ---
> >>>> kernel/sched/fair.c | 13 +++++++++++--
> >>>> 1 file changed, 11 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>>> index ab4114712be74..bd35275a05b38 100644
> >>>> --- a/kernel/sched/fair.c
> >>>> +++ b/kernel/sched/fair.c
> >>>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
> >>>> * Increase nohz.next_balance only when if full ilb is triggered but
> >>>> * not if we only update stats.
> >>>> */
> >>>> - if (flags & NOHZ_BALANCE_KICK)
> >>>> - nohz.next_balance = jiffies+1;
> >>>
> >>> This +1 only cheaply prevents multiple nohz_ilb from happening
> >>> simultaneously during the current jiffies.
> >>>
> >>> The actual update of nohz.next_balance is done in _nohz_idle_balance()
> >>> and reflects the next balance of all idle rqs. You should look at the
> >>> balance interval of your sched_domains. The min interva is the weight
> >>> of the sched_domain which can be 2 at SMT level
> >>>
> >>
> >> I did not look at the balance interval of the involved sched domain.
> >> IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
> >> we will see that updated value in nohz_balancer_kick() and if its further
> >> from current jiffies, the time_before(now, nohz.next_balance) test would
> >> cause nohz_balancer_kick() to bail out without updating flags and that in
> >> tune would avoid kick_ilb() path.
> >
> > yes
> >
> >> Since jiffies and nohz.next_balance were appearing close or same in
> >> nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
> >> almost instantly and pretty much at frequency of each tick (dtrace snippet shown
> >> below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
> >> have had their rq->next_balance close to or same as current jiffies.
> >
> > Yes
> >
> >>
> >> ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
> >> ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3
> >>
> >> Could you please let me know if this understanding is incorrect ?
> >
> > yes, it is correct.
> >
> > The ILB is kicked for several reasons:
> > - NOHZ_BALANCE_KICK : periodic load balance based on the
> > balance_interval of each sched_domain
> > - NOHZ_STATS_KICK: update of statistics i.e. decaying the blocked load
> > - NOHZ_NEXT_KICK: loop on idle cpu to update nohz.next_balance when a
> > cpu becomes idle.
> >
> > NOHZ_NEXT_KICK and NOHZ_STATS_KICK can be set independently for
> > "cheap" idle load balance
> >
> > and NOHZ_STATS_KICK is set whenever NOHZ_BALANCE_KICK is set to take
> > advantage of the ILB to update the block load instead of kicking
> > anither one just for updating the stats.
> >
> >
> >>
> >> Regarding the question of sched_domain topology, this host
> >> has 768 CPUs and almost all (except 6) have been divided
> >> between 2 cpusets (one for each node). For example for node0
> >> CPUs we have:
> >>
> >> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
> >> root
> >> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
> >> 2-191,386-575
> >>
> >> and their sched_domains look like, as shown below:
> >>
> >> cpu2:
> >> domain0: cpus=2,386
> >> domain1: cpus=2-15,386-399
> >> domain2: cpus=2-191,386-575
> >> cpu3:
> >> domain0: cpus=3,387
> >> domain1: cpus=2-15,386-399
> >> domain2: cpus=2-191,386-575
> >> cpu4:
> >> domain0: cpus=4,388
> >> domain1: cpus=2-15,386-399
> >> domain2: cpus=2-191,386-575
> >> .....
> >> .....
> >>
> >> Could you please suggest if updating rq->next_balance or
> >> final nohz.next_balance by some other logic can help reduce the
> >> CPU usage of _nohz_idle_balance or should we just ignore it
> >> because CPU is idle anyways.
> >
> > With SMT domain, the idle load balance will be kicked every 2 ms for
> > each core domain. If the load balance of all cores is not aligned on
> > the same tick, you will have an ILB every tick if there are activities
> > on some CPUs and we need to check whether it can be pulled on an idle
> > CPU. But it should be light
> >
> >>
> >> On these systems I can see that CPU 2 is doing most of this work.
> >> Running a perf top on CPU 2 gives numbers like:
> >>
> >> 21.69% [kernel] [k] __update_blocked_fair
> >> 11.40% [kernel] [k] update_load_avg
> >> 9.36% [kernel] [k] __update_load_avg_cfs_rq
> >> 8.07% [kernel] [k] update_rq_clock
> >> 7.09% [kernel] [k] __update_load_avg_se
> >> 4.67% [kernel] [k] update_irq_load_avg
> >>
> >> .....
> >> .....
> >> 22.26% [kernel] [k] __update_blocked_fair
> >> 10.89% [kernel] [k] update_load_avg
> >> 9.65% [kernel] [k] __update_load_avg_cfs_rq
> >> 7.80% [kernel] [k] update_rq_clock
> >> 7.23% [kernel] [k] __update_load_avg_se
> >> 4.76% [kernel] [k] update_sg_lb_stats
> >>
> >> and mpstat also shows softirq usage of around 20-25% on CPU 2 and
> >> most of that is due to SCHED_SOFTIRQ leading into
> >> _nohz_idle_balance.
> >
> > The time to update the blocked loads increases with the cgroup
> > hierarchy because we must to walk the hierarchy.
> >
> > Does it generate problems for your system? As you mentioned above, if
> > CPU2 is idle, running such background activities should not cause
> > harm.
> >
>
> No its not causing any issues. Should this mean that the second patch of this
> set can be dropped as well. I could see that despite multiple CPUs being idle
> in this domain, it was CPU 2 that was doing nohz idle balance most of the times.

Yes, we can drop patch 2 as well. The fact that CPU2 handles most of
the nohz idle balance is not a problem by itself

Vincent

>
> Thanks,
> Imran
>
> >>
> >> Thanks,
> >> Imran
> >>
> >> PS: I used the following dtrace snippets to get nohz_balancer_kick
> >> data shown earlier and nohz_csd_func() data shown in this message.
> >>
> >> dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'
> >>
> >>
> >>
> >> fbt::nohz_csd_func:entry
> >> {
> >> this->rq = (struct rq *)arg0;
> >> this->rq_cpu = this->rq->cpu;
> >> this->rq_nohz_flags = this->rq->nohz_flags.counter;
> >> this->ts_ms = (unsigned long)(walltimestamp / 1000000);
> >> printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
> >> /*printf("[%lu] IPI received on cpu=%d\n",
> >> this->ts_ms, cpu);*/
> >> /*@ipi_rate[cpu] = count();*/
> >> }
> >>
> >>> Which kind of sched_domain topology do you have?
> >>>
> >>>
> >>>> + if (flags & NOHZ_BALANCE_KICK) {
> >>>> + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> >>>> +
> >>>> + /*
> >>>> + * On large systems, there may always be some idle CPU(s) with
> >>>> + * rq->next_balance close to or at current time, thus causing
> >>>> + * frequent invocation of kick_ilb() from nohz_balancer_kick().
> >>>> + * Adjust next_balance based on the number of idle CPUs.
> >>>> + */
> >>>> + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
> >>>> + }
> >>>>
> >>>> ilb_cpu = find_new_ilb();
> >>>> if (ilb_cpu < 0)
> >>>> --
> >>>> 2.34.1
> >>>>
> >>
>