Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

From: imran . f . khan

Date: Tue Apr 28 2026 - 07:14:03 EST

Hello Vincent,
Thanks so much for clarifying my queries.
On 24/4/2026 5:46 pm, Vincent Guittot wrote:
> On Wed, 22 Apr 2026 at 18:13, <imran.f.khan@xxxxxxxxxx> wrote:
>>
>> Hello Vincent,
>> Thanks a lot for taking a look into this.
>> On 22/4/2026 3:54 pm, Vincent Guittot wrote:
>>> On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
>>>>
>>>> On large scale systems, for example with 768 CPUs and cpusets consisting
>>>> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
>>>> close to or same as now.
>>>> This causes nohz.next_balance to be perpetually same as current jiffies and
>>>> thus causing time based check in nohz_balancer_kick() to awlays fail.
>>>>
>>>> For example putting dtrace probe at nohz_balancer_kick, on such a system,
>>>> we can see that nohz.next_balance is at current jiffy at almost each tick:
>>>>
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
>>>> 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>>>>
>>>> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
>>>> to run almost every tick and this in turn can consume a lot of CPU cycles in
>>>> subsequenet nohz idle balancing.
>>>> So set nohz.next_balance based on number of currently idle CPUs, such that
>>>> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
>>>> This will nohz_balancer_kick to bail out early.
>>>>
>>>> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
>>>> ---
>>>> kernel/sched/fair.c | 13 +++++++++++--
>>>> 1 file changed, 11 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>>> index ab4114712be74..bd35275a05b38 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
>>>> * Increase nohz.next_balance only when if full ilb is triggered but
>>>> * not if we only update stats.
>>>> */
>>>> - if (flags & NOHZ_BALANCE_KICK)
>>>> - nohz.next_balance = jiffies+1;
>>>
>>> This +1 only cheaply prevents multiple nohz_ilb from happening
>>> simultaneously during the current jiffies.
>>>
>>> The actual update of nohz.next_balance is done in _nohz_idle_balance()
>>> and reflects the next balance of all idle rqs. You should look at the
>>> balance interval of your sched_domains. The min interva is the weight
>>> of the sched_domain which can be 2 at SMT level
>>>
>>
>> I did not look at the balance interval of the involved sched domain.
>> IIUC once nohz.next_balance has been updated in _nohz_idle_balance(),
>> we will see that updated value in nohz_balancer_kick() and if its further
>> from current jiffies, the time_before(now, nohz.next_balance) test would
>> cause nohz_balancer_kick() to bail out without updating flags and that in
>> tune would avoid kick_ilb() path.
>
> yes
>
>> Since jiffies and nohz.next_balance were appearing close or same in
>> nohz_balancer_kick() and I could see that CPU 2 was executing nohz_csd_func(),
>> almost instantly and pretty much at frequency of each tick (dtrace snippet shown
>> below), my conclusion was that one or more CPUs in sched domain of CPU 2 must
>> have had their rq->next_balance close to or same as current jiffies.
>
> Yes
>
>>
>> ts_ms = 1776868498610 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498611 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498612 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498613 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498614 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498615 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498616 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498617 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498618 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498619 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498620 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498621 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498622 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498623 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498624 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498625 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498626 rq_cpu = 2 nohz_flags = 3
>> ts_ms = 1776868498627 rq_cpu = 2 nohz_flags = 3
>>
>> Could you please let me know if this understanding is incorrect ?
>
> yes, it is correct.
>
> The ILB is kicked for several reasons:
> - NOHZ_BALANCE_KICK : periodic load balance based on the
> balance_interval of each sched_domain
> - NOHZ_STATS_KICK: update of statistics i.e. decaying the blocked load
> - NOHZ_NEXT_KICK: loop on idle cpu to update nohz.next_balance when a
> cpu becomes idle.
>
> NOHZ_NEXT_KICK and NOHZ_STATS_KICK can be set independently for
> "cheap" idle load balance
>
> and NOHZ_STATS_KICK is set whenever NOHZ_BALANCE_KICK is set to take
> advantage of the ILB to update the block load instead of kicking
> anither one just for updating the stats.
>
>
>>
>> Regarding the question of sched_domain topology, this host
>> has 768 CPUs and almost all (except 6) have been divided
>> between 2 cpusets (one for each node). For example for node0
>> CPUs we have:
>>
>> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.partition
>> root
>> # cat /sys/fs/cgroup/sellable-numa0/cpuset.cpus.effective
>> 2-191,386-575
>>
>> and their sched_domains look like, as shown below:
>>
>> cpu2:
>> domain0: cpus=2,386
>> domain1: cpus=2-15,386-399
>> domain2: cpus=2-191,386-575
>> cpu3:
>> domain0: cpus=3,387
>> domain1: cpus=2-15,386-399
>> domain2: cpus=2-191,386-575
>> cpu4:
>> domain0: cpus=4,388
>> domain1: cpus=2-15,386-399
>> domain2: cpus=2-191,386-575
>> .....
>> .....
>>
>> Could you please suggest if updating rq->next_balance or
>> final nohz.next_balance by some other logic can help reduce the
>> CPU usage of _nohz_idle_balance or should we just ignore it
>> because CPU is idle anyways.
>
> With SMT domain, the idle load balance will be kicked every 2 ms for
> each core domain. If the load balance of all cores is not aligned on
> the same tick, you will have an ILB every tick if there are activities
> on some CPUs and we need to check whether it can be pulled on an idle
> CPU. But it should be light
>
>>
>> On these systems I can see that CPU 2 is doing most of this work.
>> Running a perf top on CPU 2 gives numbers like:
>>
>> 21.69% [kernel] [k] __update_blocked_fair
>> 11.40% [kernel] [k] update_load_avg
>> 9.36% [kernel] [k] __update_load_avg_cfs_rq
>> 8.07% [kernel] [k] update_rq_clock
>> 7.09% [kernel] [k] __update_load_avg_se
>> 4.67% [kernel] [k] update_irq_load_avg
>>
>> .....
>> .....
>> 22.26% [kernel] [k] __update_blocked_fair
>> 10.89% [kernel] [k] update_load_avg
>> 9.65% [kernel] [k] __update_load_avg_cfs_rq
>> 7.80% [kernel] [k] update_rq_clock
>> 7.23% [kernel] [k] __update_load_avg_se
>> 4.76% [kernel] [k] update_sg_lb_stats
>>
>> and mpstat also shows softirq usage of around 20-25% on CPU 2 and
>> most of that is due to SCHED_SOFTIRQ leading into
>> _nohz_idle_balance.
>
> The time to update the blocked loads increases with the cgroup
> hierarchy because we must to walk the hierarchy.
>
> Does it generate problems for your system? As you mentioned above, if
> CPU2 is idle, running such background activities should not cause
> harm.
>

No its not causing any issues. Should this mean that the second patch of this
set can be dropped as well. I could see that despite multiple CPUs being idle
in this domain, it was CPU 2 that was doing nohz idle balance most of the times.

Thanks,
Imran

>>
>> Thanks,
>> Imran
>>
>> PS: I used the following dtrace snippets to get nohz_balancer_kick
>> data shown earlier and nohz_csd_func() data shown in this message.
>>
>> dtrace -n 'fbt::nohz_balancer_kick:entry {printf("jiffies = %lu nohz.next_balance = %lu \n", `jiffies, `nohz.next_balance);}'
>>
>>
>>
>> fbt::nohz_csd_func:entry
>> {
>> this->rq = (struct rq *)arg0;
>> this->rq_cpu = this->rq->cpu;
>> this->rq_nohz_flags = this->rq->nohz_flags.counter;
>> this->ts_ms = (unsigned long)(walltimestamp / 1000000);
>> printf("ts_ms = %lu rq_cpu = %d nohz_flags = %d \n", this->ts_ms, this->rq_cpu, this->rq_nohz_flags);
>> /*printf("[%lu] IPI received on cpu=%d\n",
>> this->ts_ms, cpu);*/
>> /*@ipi_rate[cpu] = count();*/
>> }
>>
>>> Which kind of sched_domain topology do you have?
>>>
>>>
>>>> + if (flags & NOHZ_BALANCE_KICK) {
>>>> + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
>>>> +
>>>> + /*
>>>> + * On large systems, there may always be some idle CPU(s) with
>>>> + * rq->next_balance close to or at current time, thus causing
>>>> + * frequent invocation of kick_ilb() from nohz_balancer_kick().
>>>> + * Adjust next_balance based on the number of idle CPUs.
>>>> + */
>>>> + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
>>>> + }
>>>>
>>>> ilb_cpu = find_new_ilb();
>>>> if (ilb_cpu < 0)
>>>> --
>>>> 2.34.1
>>>>
>>