Re: [PATCH] sched/fair: Avoid false sharing in nohz struct

From: Guo, Wangyang

Date: Sun Dec 21 2025 - 21:21:49 EST


On 12/21/2025 9:05 PM, Shrikanth Hegde wrote:
Hi Wangyang,

On 12/11/25 11:26 AM, Wangyang Guo wrote:
There are two potential false sharing issue in nohz struct:
1. idle_cpus_mask is a read-mostly field, but share the same cacheline
    with frequently updated nr_cpus.

Updates to idle_cpus_mask is not same cacheline. it is updated alongside nr_cpus.

with CPUMASK_OFFSTACK=y, idle_cpus_mask is a pointer to the actual mask.
Updates to it happen in another cacheline.

with CPUMASK_OFFSTACK=n, idle_cpus_mask is on the stack and its length
depends on NR_CPUS. typical value being 512/2048/8192 it can span a few
cachelines. So updates to it likely in different cacheline compared to nr_cpus.

see  https://lore.kernel.org/all/aS6bK4ad-wO2fsoo@xxxxxxxxx/

This patch is mainly target for idle_cpus_mask as a pointer, which is default for many distro OS.


Likely in your case, nr_cpus updates are the costly ones.
Try below and see if it helps to fix your issue too.
https://lore.kernel.org/all/20251201183146.74443-1-sshegde@xxxxxxxxxxxxx/
I Should send out new version soon.

2. Data followed by nohz still share the same cacheline and has
    potential false sharing issue.


How does your patch handle this?
I don't see any other struct apart from nohz being changed.

The data follow by nohz is implicit and determined by compiler.
For example, this is the layout from /proc/kallsyms in my machine:
ffffffff88600d40 b nohz
ffffffff88600d68 B arch_needs_tick_broadcast
ffffffff88600d6c b __key.264
ffffffff88600d6c b __key.265
ffffffff88600d70 b dl_generation
ffffffff88600d78 b sched_clock_irqtime

What we can do is placing read-mostly `idle_cpus_mask` pointer in a new cacheline, so data followed by nohz would not be affected by nr_cpus.


This patch tries to resolve the above two problems by isolating the
frequently updated fields in a single cacheline.

Reported-by: Benjamin Lei <benjamin.lei@xxxxxxxxx>
Reviewed-by: Tim Chen <tim.c.chen@xxxxxxxxxxxxxxx>
Reviewed-by: Tianyou Li <tianyou.li@xxxxxxxxx>
Signed-off-by: Wangyang Guo <wangyang.guo@xxxxxxxxx>
---
  kernel/sched/fair.c | 7 ++++---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5b752324270b..bcc2766b7986 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7193,13 +7193,14 @@ static DEFINE_PER_CPU(cpumask_var_t, should_we_balance_tmpmask);
  #ifdef CONFIG_NO_HZ_COMMON
  static struct {
-    cpumask_var_t idle_cpus_mask;
-    atomic_t nr_cpus;
+    /* Isolate frequently updated fields in a cacheline to avoid false sharing issue. */
+    atomic_t nr_cpus ____cacheline_aligned;
      int has_blocked;        /* Idle CPUS has blocked load */
      int needs_update;        /* Newly idle CPUs need their next_balance collated */
      unsigned long next_balance;     /* in jiffy units */
      unsigned long next_blocked;    /* Next update of blocked load in jiffies */
-} nohz ____cacheline_aligned;
+    cpumask_var_t idle_cpus_mask ____cacheline_aligned;
+} nohz;


This can cause a lot of space wastage.
for exp: powerpc has 128 byte cacheline.


nohz is global, only one exists. The size inflating is minimal, less than 1 cacheline.

BR
Wangyang