Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state

From: Shubhang Kaushik

Date: Thu Feb 12 2026 - 15:04:28 EST


On Thu, 12 Feb 2026, Shubhang Kaushik wrote:

Because you rely on dynamic placement of isolated tasks throughout
isolated
CPUs by the scheduler.

But nohz_full is designed for running only one task per isolated CPU
without
any disturbance. And migration is a significant disturbance. This is why
nohz_full tries not to be too smart and assumes that task placement is
entirely
within the hands of the user.

So I have to ask, what prevents you from using static task placement in
your
workload?

Actually, the llama-batched-bench results I shared already included static affinity testing via numactl -C.

What I mean by that is even when tasks are strictly pinned to individual cores, the performance gap remains.

IIUC, the current implementation assumes tick-stop and idle-entry are coupled. While this holds for standard NOHZ, nohz_full decouples them, causing idle CPUs to be omitted from nohz.idle_cpus_mask.

This hides idle capacity from the NOHZ idle balancer, forcing housekeeping tasks onto active cores. By decoupling these transitions in the code, we ensure accurate state accounting.


Even with static placement, we observe this ~14% throughput improvement. This suggests that the issue isn't about the scheduler trying to be smart with task migration, but rather about the side effects of an idle CPU being absent from nohz.idle_cpus_mask.

When nohz_full CPUs enter idle but aren't correctly accounted for in the idle mask, it appears to cause unnecessary overhead or interference in the NOHZ load balancing logic for the CPUs that are still running tasks. By ensuring the idle state is correctly tracked, we're not encouraging migration, but rather ensuring the scheduler's global state accurately reflects reality.

AFAICT this seems to be a case where correcting the bookkeeping benefits HPC throughput even when the user handles all task placement manually.

Regards,
Shubhang Kaushik

I'm not saying it's undesirable or impossible to do adaptive userspace
dyntick
for users that don't rely on ultra low latency but rather on high
CPU-bound
performance. In fact the initial purpose of nohz_full was for HPC and not
real-time. Turns out that real time is all the usecase I have seen so far
and
you're the first HPC one. But adapting nohz_full dynamically for that will
involve
much more than just load balancing. Now the static affinity should work
for
everyone.

Thanks.



Signed-off-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
Signed-off-by: Adam Li <adamli@xxxxxxxxxxxxxxxxxxxxxx>
Reviewed-by: Christoph Lameter (Ampere) <cl@xxxxxxxxxx>
Reviewed-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
---
This is a resend of the original patch to ensure visibility.
Previous resend: https://lkml.org/lkml/2025/8/21/170
Original thread: https://lkml.org/lkml/2025/8/21/171

The patch addresses a performance regression in NOHZ idle load balancing
observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming
invisible to the balancer.
---
kernel/time/tick-sched.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index
2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e
100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
ts->idle_sleeps++;
ts->idle_expires = expires;

- if (!was_stopped && tick_sched_flag_test(ts,
TS_FLAG_STOPPED)) {
- ts->idle_jiffies = ts->last_jiffies;
+ if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
+ if (!was_stopped)
+ ts->idle_jiffies = ts->last_jiffies;
nohz_balance_enter_idle(cpu);
}
} else {

---
base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
change-id: 20260203-fix-nohz-idle-b2838276cb91

Best regards,
--
Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>


--
Frederic Weisbecker
SUSE Labs