Re: [PATCH v3 3/3] sched/fair: Remove nohz.nr_cpus and use weight of cpumask instead

From: Shrikanth Hegde
Date: Fri Jan 09 2026 - 10:19:34 EST


Hi Valentin. Thanks for going through.

On 1/9/26 8:14 PM, Valentin Schneider wrote:
On 07/01/26 12:21, Shrikanth Hegde wrote:
nohz.nr_cpus was observed as contended cacheline when running
enterprise workload on large systems.

Fundamental scalability challenge with nohz.idle_cpus_mask
and nohz.nr_cpus is the following:

(1) nohz_balancer_kick() observes (reads) nohz.nr_cpus
(or nohz.idle_cpu_mask) and nohz.has_blocked to see whether there's
any nohz balancing work to do, in every scheduler tick.

(2) nohz_balance_enter_idle() and nohz_balance_exit_idle()
(through nohz_balancer_kick() via sched_tick()) modify (write)
nohz.nr_cpus (and/or nohz.idle_cpu_mask) and nohz.has_blocked.


My first reaction on reading the whole changelog was: "but .nr_cpus and
.idle_cpus_mask are in the same cacheline?!", which as Ingo pointed out
somewhere down [1] isn't true for CPUMASK_OFFSTACK, so this change
effectively gets rid of the dirtying of one extra cacheline during idle
entry/exit.

[1]: http://lore.kernel.org/r/aS3za7X9BLS5rg65@xxxxxxxxx

I'd suggest adding something like so in this part of the changelog:

"""
Note that nohz.idle_cpus_mask and nohz.nr_cpus reside in the same
cacheline, however under CONFIG_CPUMASK_OFFSTACK the backing storage for
nohz.idle_cpus_mask will be elsewhere. This implies two separate cachelines
being dirtied upon idle entry / exit.
"""


ok. Will do that. Thanks.

Even for CONFIG_CPUMASK_OFFSTACK=n, usual configuration is like 512/1024/
2048 or higher.

For 64 byte cacheline, 1 cacheline can hold 512 CPUs.
So idle_cpus_mask and rest of nohz fields including nr_cpus will be in different
cacheline.

Even for powerpc(128 byte cacheline), where CONFIG_CPUMASK_OFFSTACK=n,
default is NR_CPUS=2048. that means idle_cpus_mask will take 2 cachelines and rest
of nohz fields will be in third cacheline.

So in most of the cases, this implies dirtying one less cacheline.

data points with CONFIG_CPUMASK_OFFSTACK=y/n
[1]: https://lore.kernel.org/all/fdb378e7-7797-4aeb-a79f-12af4cb1b81a@xxxxxxxxxxxxx/