Re: [PATCH 4/4] sched/fair: Remove atomic nr_cpus and use cpumask instead

From: Shrikanth Hegde
Date: Tue Dec 02 2025 - 00:30:10 EST


Hi Ingo,

Thanks for taking a look at this.

On 12/2/25 1:28 AM, Ingo Molnar wrote:

* Shrikanth Hegde <sshegde@xxxxxxxxxxxxx> wrote:

nohz_balance_enter_idle:
cpumask_set_cpu(cpu, nohz.idle_cpus_mask)
atomic_inc(&nohz.nr_cpus)

nohz_balance_exit_idle:
cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask)
atomic_dec(&nohz.nr_cpus)

kick_ilb:
if (likely(!atomic_read(&nohz.nr_cpus)))
return;

So, idle_cpus_mask contains the same information. Instead of doing
costly atomic in large systems, its better to check if cpumask is empty
or not to make the same decision to trigger idle load balance.

There might be race between cpumask_empty check and set of cpumask in
the remote CPUs. In such case at next tick idle load balance will be
triggered. Race of clearing the bit is not a concern, since _nohz_idle_balance
checks if CPU is idle or not before doing the balance.

cpumask_empty uses ffs. So should not be very costly.

Signed-off-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>

static struct {
cpumask_var_t idle_cpus_mask;
- atomic_t nr_cpus;
int has_blocked; /* Idle CPUS has blocked load */
int needs_update; /* Newly idle CPUs need their next_balance collated */
unsigned long next_balance; /* in jiffy units */
@@ -12450,7 +12449,7 @@ static void nohz_balancer_kick(struct rq *rq)
* None are in tickless mode and hence no need for NOHZ idle load
* balancing, do stats update if its due
*/
- if (unlikely(!atomic_read(&nohz.nr_cpus)))
+ if (unlikely(cpumask_empty(nohz.idle_cpus_mask)))
goto out;

So the thing is, if the goal is to avoid cacheline
bouncing, this won't fundamentally change the
situation:

rq->nohz_tick_stopped = 0;
cpumask_clear_cpu(rq->cpu, nohz.idle_cpus_mask);
- atomic_dec(&nohz.nr_cpus);

nohz.idle_cpus_mask will be on a single 64-byte
cacheline even on 512 CPU systems, and the
cpumask_clear_cpu() and cpumask_set_cpu() calls will
dirty the cacheline and make it bounce with exactly the
same frequency as the atomic_inc/dec() of nohz.nr_cpus
does today.

From the 0/4 boilerplate description:

> It was noted when running on large systems
> nohz.nr_cpus cacheline was bouncing quite often.
> There is atomic inc/dec and read happening on many
> CPUs at a time and it is possible for this line to
> bounce often.

That the nr_cpus modification is an atomic op doesn't
change the situation much in terms of cacheline
bouncing, because the cacheline dirtying will still
cause comparable levels of bouncing on modern CPUs with
modern cache coherency protocols.

If nr_cpus and nohz.nr_cpus are in separate cachelines,
then this patch might eliminate about half of the
bounces - but AFAICS they are right next to each other,
so unless it's off-stack cpumasks, they should be in
the same cacheline. Half of 'bad bouncing' is still
kinda 'bad bouncing'. :-)


You are right. If we have to get rid of cacheline bouncing then
we need to fix nohz.idle_cpus_mask too.

I forgot about CPUMASK_OFFSTACK.

If CPUMASK_OFFSTACK=y, then both idle_cpus_mask and nr_cpus are in same
cacheline Right?. That data in cover-letter is with =y. In that case, getting
it to cpumask_empty will give minimal gains by remvong an additional
atomic inc/dec operations.

If CPUMASK_OFFSTACK=n, then they could be in different cacheline.
In that case gains should be better. Very likely our performance team
would have done with =n.
IIRC, on powerpc, based on NR_CPU we change it. On x86 it chooses NR_CPUs.

arm64/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS > 256
powerpc/Kconfig: select CPUMASK_OFFSTACK if NR_CPUS >= 8192
x86/Kconfig: select CPUMASK_OFFSTACK
x86/Kconfig: default 8192 if SMP && CPUMASK_OFFSTACK
x86/Kconfig: default 512 if SMP && !CPUMASK_OFFSTACK



In either case, if we think,
nohz.nr_cpus == cpumask_weight(nohz.idle_cpus_mask)

Since it is not a correctness stuff here, at worst we will lose a chance
to do idle load balance. But at next tick we will do the idle balance.
Looking at code it might happen even today, First we set/clear the mask
and then we do inc/dec. So if mask was set, but inc hasn't happened, but
read completed, then would lose a chance. (though very slim)


I'm not really objecting to the patch, because it would
reduce cacheline bouncing in the offstack-mask case,
but the explanation isn't very clear about these
details.


Let me re-write changelog. Also see a bit more into it.


Thanks,

Ingo