Re: [PATCH v4] sched/fair: Skip sched_balance_running cmpxchg when balance is not due

From: Shrikanth Hegde

Date: Tue Nov 11 2025 - 01:26:00 EST


Hi Tim,

On 11/11/25 12:17 AM, Tim Chen wrote:
The NUMA sched domain sets the SD_SERIALIZE flag by default, allowing
only one NUMA load balancing operation to run system-wide at a time.

Currently, each sched group leader directly under NUMA domain attempts
to acquire the global sched_balance_running flag via cmpxchg() before
checking whether load balancing is due or whether it is the designated
load balancer for that NUMA domain. On systems with a large number
of cores, this causes significant cache contention on the shared
sched_balance_running flag.

This patch reduces unnecessary cmpxchg() operations by first checking
that the balancer is the designated leader for a NUMA domain from
should_we_balance(), and the balance interval has expired before
trying to acquire sched_balance_running to load balance a NUMA
domain.

On a 2-socket Granite Rapids system with sub-NUMA clustering enabled,
running an OLTP workload, 7.8% of total CPU cycles were previously spent
in sched_balance_domain() contending on sched_balance_running before
this change.
Looks good to me. Thanks for getting this into current shape.

I see hackbench improving slightly across its variations. So,
Tested-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>