Re: [RESEND PATCH] tick/nohz: Fix wrong NOHZ idle CPU state

From: Frederic Weisbecker

Date: Thu Feb 12 2026 - 09:35:08 EST

Le Tue, Feb 03, 2026 at 04:49:03PM -0800, Shubhang Kaushik a écrit :
> Under CONFIG_NO_HZ_FULL, the scheduler tick can get stopped earlier via
> tick_nohz_full_stop_tick() before the CPU subsequently enters the idle
> path. In this case, tick_nohz_idle_stop_tick() observes TS_FLAG_STOPPED
> already set and skips nohz_balance_enter_idle() because the !was_stopped
> condition assumes tick-stop and idle-entry are coupled.
> This leaves a tickless idle CPU absent from nohz.idle_cpus_mask, making
> it invisible to NOHZ idle load balancing while periodic balancing is
> also suppressed.
>
> The patch fixes this by decoupling tick-stop transition accounting from
> scheduler bookkeeping. idle_jiffies remains updated only on the
> tick-stop transition, while nohz_balance_enter_idle() is invoked
> whenever a CPU enters idle with the tick already stopped, relying on its
> existing idempotent gaurd to avoid duplicate registration.
>
> Tested on Ampere Altra on 6.19.0-rc8 with CONFIG_NO_HZ_FULL enabled:
> - This change improves load distribution by ensuring that tickless idle
> CPUs are visible to NOHZ idle load balancing. In llama-batched-bench,
> throughput improves by up to ~14% across multiple thread counts.
> - Hackbench single-process results improve by 5% and multi-process
> results improve by up to ~26%, consistent with reduced scheduler
> jitter and earlier utilization of fully idle cores.
> No regressions observed.

Because you rely on dynamic placement of isolated tasks throughout isolated
CPUs by the scheduler.

But nohz_full is designed for running only one task per isolated CPU without
any disturbance. And migration is a significant disturbance. This is why
nohz_full tries not to be too smart and assumes that task placement is entirely
within the hands of the user.

So I have to ask, what prevents you from using static task placement in your
workload?

I'm not saying it's undesirable or impossible to do adaptive userspace dyntick
for users that don't rely on ultra low latency but rather on high CPU-bound
performance. In fact the initial purpose of nohz_full was for HPC and not
real-time. Turns out that real time is all the usecase I have seen so far and
you're the first HPC one. But adapting nohz_full dynamically for that will involve
much more than just load balancing. Now the static affinity should work for
everyone.

Thanks.

>
> Signed-off-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Adam Li <adamli@xxxxxxxxxxxxxxxxxxxxxx>
> Reviewed-by: Christoph Lameter (Ampere) <cl@xxxxxxxxxx>
> Reviewed-by: Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
> ---
> This is a resend of the original patch to ensure visibility.
> Previous resend: https://lkml.org/lkml/2025/8/21/170
> Original thread: https://lkml.org/lkml/2025/8/21/171
>
> The patch addresses a performance regression in NOHZ idle load balancing
> observed under CONFIG_NO_HZ_FULL, where idle CPUs were becoming
> invisible to the balancer.
> ---
> kernel/time/tick-sched.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 2f8a7923fa279409ffe950f770ff2eac868f6ece..eee6fcebe78c2f8d93464a55fe332e12fe9c164e 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1250,8 +1250,9 @@ void tick_nohz_idle_stop_tick(void)
> ts->idle_sleeps++;
> ts->idle_expires = expires;
>
> - if (!was_stopped && tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> - ts->idle_jiffies = ts->last_jiffies;
> + if (tick_sched_flag_test(ts, TS_FLAG_STOPPED)) {
> + if (!was_stopped)
> + ts->idle_jiffies = ts->last_jiffies;
> nohz_balance_enter_idle(cpu);
> }
> } else {
>
> ---
> base-commit: 18f7fcd5e69a04df57b563360b88be72471d6b62
> change-id: 20260203-fix-nohz-idle-b2838276cb91
>
> Best regards,
> --
> Shubhang Kaushik <shubhang@xxxxxxxxxxxxxxxxxxxxxx>
>

--
Frederic Weisbecker
SUSE Labs