Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

From: Vincent Guittot

Date: Wed Apr 22 2026 - 03:55:15 EST


On Tue, 21 Apr 2026 at 07:06, Imran Khan <imran.f.khan@xxxxxxxxxx> wrote:
>
> On large scale systems, for example with 768 CPUs and cpusets consisting
> of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
> close to or same as now.
> This causes nohz.next_balance to be perpetually same as current jiffies and
> thus causing time based check in nohz_balancer_kick() to awlays fail.
>
> For example putting dtrace probe at nohz_balancer_kick, on such a system,
> we can see that nohz.next_balance is at current jiffy at almost each tick:
>
> 447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
> 447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
> 447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
> 447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
> 447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
> 447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
> 447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
> 447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
> 447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
> 447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
> 447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
> 447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
> 447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
> 447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
> 447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
> 447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878
>
> On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
> to run almost every tick and this in turn can consume a lot of CPU cycles in
> subsequenet nohz idle balancing.
> So set nohz.next_balance based on number of currently idle CPUs, such that
> for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
> This will nohz_balancer_kick to bail out early.
>
> Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 13 +++++++++++--
> 1 file changed, 11 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index ab4114712be74..bd35275a05b38 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
> * Increase nohz.next_balance only when if full ilb is triggered but
> * not if we only update stats.
> */
> - if (flags & NOHZ_BALANCE_KICK)
> - nohz.next_balance = jiffies+1;

This +1 only cheaply prevents multiple nohz_ilb from happening
simultaneously during the current jiffies.

The actual update of nohz.next_balance is done in _nohz_idle_balance()
and reflects the next balance of all idle rqs. You should look at the
balance interval of your sched_domains. The min interva is the weight
of the sched_domain which can be 2 at SMT level

Which kind of sched_domain topology do you have?


> + if (flags & NOHZ_BALANCE_KICK) {
> + unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
> +
> + /*
> + * On large systems, there may always be some idle CPU(s) with
> + * rq->next_balance close to or at current time, thus causing
> + * frequent invocation of kick_ilb() from nohz_balancer_kick().
> + * Adjust next_balance based on the number of idle CPUs.
> + */
> + nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);
> + }
>
> ilb_cpu = find_new_ilb();
> if (ilb_cpu < 0)
> --
> 2.34.1
>