Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state

From: Shubhang Kaushik

Date: Thu Feb 12 2026 - 14:36:39 EST

Hi Frederic,

On Thu, 12 Feb 2026, Frederic Weisbecker wrote:

Tested on Ampere Altra on 6.19.0-rc8 with CONFIG_NO_HZ_FULL enabled:
- This change improves load distribution by ensuring that tickless idle
CPUs are visible to NOHZ idle load balancing. In llama-batched-bench,
throughput improves by up to ~14% across multiple thread counts.
- Hackbench single-process results improve by 5% and multi-process
results improve by up to ~26%, consistent with reduced scheduler
jitter and earlier utilization of fully idle cores.
No regressions observed.

Because you rely on dynamic placement of isolated tasks throughout isolated
CPUs by the scheduler.

But nohz_full is designed for running only one task per isolated CPU without
any disturbance. And migration is a significant disturbance. This is why
nohz_full tries not to be too smart and assumes that task placement is entirely
within the hands of the user.

So I have to ask, what prevents you from using static task placement in your
workload?

Actually, the llama-batched-bench results I shared already included static affinity testing via numactl -C.

Even with static placement, we observe this ~14% throughput improvement. This suggests that the issue isn't about the scheduler trying to be smart with task migration, but rather about the side effects of an idle CPU being absent from nohz.idle_cpus_mask.

When nohz_full CPUs enter idle but aren't correctly accounted for in the idle mask, it appears to cause unnecessary overhead or interference in the NOHZ load balancing logic for the CPUs that are still running tasks. By ensuring the idle state is correctly tracked, we're not encouraging migration, but rather ensuring the scheduler's global state accurately reflects reality.

AFAICT this seems to be a case where correcting the bookkeeping benefits HPC throughput even when the user handles all task placement manually.

Regards,
Shubhang Kaushik

I'm not saying it's undesirable or impossible to do adaptive userspace dyntick
for users that don't rely on ultra low latency but rather on high CPU-bound
performance. In fact the initial purpose of nohz_full was for HPC and not
real-time. Turns out that real time is all the usecase I have seen so far and
you're the first HPC one. But adapting nohz_full dynamically for that will involve
much more than just load balancing. Now the static affinity should work for
everyone.

Thanks.

Signed-off-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Adam Li <adamli@xxxxxxxxxxxxxxxxxxxxxx>
Reviewed-by: Christoph Lameter (Ampere) <cl@xxxxxxxxxx>
Reviewed-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
---
This is a resend of the original patch to ensure visibility.
Previous resend: https://lkml.org/lkml/2025/8/21/170
Original thread: https://lkml.org/lkml/2025/8/21/171

The patch addresses a performance regression in NOHZ idle load balancing
observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming
invisible to the balancer.
---
kernel/time/tick-sched.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
ts->idle_sleeps++;
ts->idle_expires = expires;

- if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
- ts->idle_jiffies = ts->last_jiffies;
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+ if (!was_stopped)
+ ts->idle_jiffies = ts->last_jiffies;
nohz_balance_enter_idle(cpu);
}
} else {

---
base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
change-id: 20260203-fix-nohz-idle-b2838276cb91

Best regards,
--
Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>

--
Frederic Weisbecker
SUSE Labs