Re: [PATCH 1/2] sched/fair: scale nohz.next_balance according to number of idle CPUs.

From: Shrikanth Hegde

Date: Tue Apr 21 2026 - 13:34:07 EST


Hi Imran,

On 4/21/26 10:36 AM, Imran Khan wrote:
On large scale systems, for example with 768 CPUs and cpusets consisting
of 380+ CPUs, there may always be some idle CPU with it's rq->next_balance
close to or same as now.
This causes nohz.next_balance to be perpetually same as current jiffies and
thus causing time based check in nohz_balancer_kick() to awlays fail.

Some benchmarks will be happy with faster idle load balance and some not.
Could you share the performance numbers or benchmarks you have tried?


For example putting dtrace probe at nohz_balancer_kick, on such a system,
we can see that nohz.next_balance is at current jiffy at almost each tick:


This depends on the system utilization too. When system is idle, i see
nohz.next_balance increments randomly. But around 50% utilization, it increments by
1-2 ticks. Similar observation as you have.

What was the utilization in the below case? or was it combination of specific number
of threads and its utilization?

447 9536 nohz_balancer_kick:entry jiffies=9764770863 nohz.next_balance=9764770863
447 9536 nohz_balancer_kick:entry jiffies=9764770864 nohz.next_balance=9764770864
447 9536 nohz_balancer_kick:entry jiffies=9764770865 nohz.next_balance=9764770865
447 9536 nohz_balancer_kick:entry jiffies=9764770866 nohz.next_balance=9764770866
447 9536 nohz_balancer_kick:entry jiffies=9764770867 nohz.next_balance=9764770867
447 9536 nohz_balancer_kick:entry jiffies=9764770868 nohz.next_balance=9764770868
447 9536 nohz_balancer_kick:entry jiffies=9764770869 nohz.next_balance=9764770870
447 9536 nohz_balancer_kick:entry jiffies=9764770870 nohz.next_balance=9764770870
447 9536 nohz_balancer_kick:entry jiffies=9764770871 nohz.next_balance=9764770871
447 9536 nohz_balancer_kick:entry jiffies=9764770872 nohz.next_balance=9764770872
447 9536 nohz_balancer_kick:entry jiffies=9764770873 nohz.next_balance=9764770873
447 9536 nohz_balancer_kick:entry jiffies=9764770874 nohz.next_balance=9764770874
447 9536 nohz_balancer_kick:entry jiffies=9764770875 nohz.next_balance=9764770876
447 9536 nohz_balancer_kick:entry jiffies=9764770876 nohz.next_balance=9764770876
447 9536 nohz_balancer_kick:entry jiffies=9764770877 nohz.next_balance=9764770877
447 9536 nohz_balancer_kick:entry jiffies=9764770878 nohz.next_balance=9764770878

On such system setting nohz.next_balance to next jiffy can cause kick_ilb()
to run almost every tick and this in turn can consume a lot of CPU cycles in
subsequenet nohz idle balancing.
So set nohz.next_balance based on number of currently idle CPUs, such that
for 32 idle CPUs nohz.next_balance is advanced further by 1 jiffy.
This will nohz_balancer_kick to bail out early.


I gave the patch series a go and observe at 25% load to see how the increments happens.
I have attached the tracing diff at the end.

I still see nohz.next_balance increment by 1-2 ticks under same 25% load at some places.
Overall it is better with patch, but very difficult to observe the improvement.

How does nohz.next_balance increments in your case with patch?

Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
---
kernel/sched/fair.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ab4114712be74..bd35275a05b38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12447,8 +12447,17 @@ static void kick_ilb(unsigned int flags)
* Increase nohz.next_balance only when if full ilb is triggered but
* not if we only update stats.
*/
- if (flags & NOHZ_BALANCE_KICK)
- nohz.next_balance = jiffies+1;
+ if (flags & NOHZ_BALANCE_KICK) {
+ unsigned int nr_idle = cpumask_weight(nohz.idle_cpus_mask);
+
+ /*
+ * On large systems, there may always be some idle CPU(s) with
+ * rq->next_balance close to or at current time, thus causing
+ * frequent invocation of kick_ilb() from nohz_balancer_kick().
+ * Adjust next_balance based on the number of idle CPUs.
+ */
+ nohz.next_balance = jiffies + 1 + ((nr_idle > 32) ? ilog2(nr_idle) - 4 : 0);


Also, I have see with traces using below patch that nohz.next_balance goes
backwards sometimes.(Without your patches too).
Did WRITE_ONCE for all nohz.next_balance writes, still seen.

Shouldn;t be a big concern i guess.


PS:
I have used below diff to print the values.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7a298d149f29..452a981df48b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12525,6 +12525,7 @@ static void nohz_balancer_kick(struct rq *rq)
* But idle load balancing is not done as find_new_ilb fails.
* That's very rare. So read nohz.nr_cpus only if time is due.
*/
+ trace_printk("cpu: %d, jiffies: %lu, next_balance: %lu\n", cpu, now, nohz.next_balance);
if (time_before(now, nohz.next_balance))
goto out;