Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: Chen, Yu C

Date: Wed Apr 01 2026 - 23:15:47 EST


Hello Prateek,

On 3/31/2026 6:19 PM, K Prateek Nayak wrote:
Hello Chenyu,

On 3/31/2026 11:07 AM, Chen, Yu C wrote:
update of the test:
With above change, I did a simple hackbench test on
a system with multiple LLCs within 1 node, so the benefit
is significant(+12%~+30%) when system is under-loaded, while
some regression when overloaded(-10%)(need to figure out)

Could it be because of how we are traversing the CPUs now for idle load
balancing? Since we use the first set bit for ilb_cpu and also staring
balancing from that very CPu, we might just stop after a successful
balance on the ilb_cpu.

Would something like below on top of Peter's suggestion + your fix help?

(lightly tested; Has survived sched messaging on baremetal)

diff --git a/include/linux/sbm.h b/include/linux/sbm.h
index 8beade6c0585..98c4c1866534 100644
--- a/include/linux/sbm.h
+++ b/include/linux/sbm.h
@@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu)
return __sbm_op(sbm, test_bit);
}
+static __always_inline
+unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start)
+{
+ int bit = sbm_find_next_bit(sbm, start);
+
+ if (bit >= 0 || start == 0)
+ return bit;
+
+ bit = sbm_find_next_bit(sbm, 0);
+ return bit < start ? bit : -1;
+}
+
+static __always_inline
+unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n)
+{
+ int bit;
+
+ /* If not wrapped around */
+ if (n > start) {
+ /* and have a bit, just return it. */
+ bit = sbm_find_next_bit(sbm, n);
+ if (bit >= 0)
+ return bit;
+
+ /* Otherwise, wrap around and ... */
+ n = 0;
+ }
+
+ /* Search the other part. */
+ bit = sbm_find_next_bit(sbm, n);
+ return bit < start ? bit : -1;
+}
+
#define sbm_for_each_set_bit(sbm, idx) \
for (int idx = sbm_find_next_bit(sbm, 0); \
idx >= 0; idx = sbm_find_next_bit(sbm, idx+1))
+#define sbm_for_each_set_bit_wrap(sbm, idx, start) \
+ for (int idx = sbm_find_next_bit_wrap(sbm, start); \
+ idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1))
+
#endif /* _LINUX_SBM_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3a423c4706e..f485afb6286d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
int this_cpu = this_rq->cpu;
int balance_cpu;
struct rq *rq;
+ u32 start;
WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);
@@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
* Start with the next CPU after this_cpu so we will end with this_cpu and let a
* chance for other idle cpu to pull load.
*/
- sbm_for_each_set_bit(nohz.sbm, idx) {
+ start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids);
+ sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) {
balance_cpu = arch_sbm_idx_to_cpu(idx);
if (!idle_cpu(balance_cpu))
---

This is pretty much giving me similar performance as tip for sched
messaging runs under heavy load but your mileage may vary :-)


Thanks very much for providing this optimization. It should help
more nohz idle CPUs-beyond just the currently selected ilb_cpu
to assist in offloading work. When I applied this patch and reran
the test, it appeared to introduce some regressions (underload and
overload) compared to the baseline without Peter’s sbm applied.

One suspicion is that with sbm enabled(without your patch), more
tasks are "aggregated" onto the first CPU(or maybe the front part)
in nohz.sbm, because sbm_for_each_set_bit() always picks the first
idle CPU to pull work. As we already know, hackbench on our
platform strongly prefers being aggregated rather than being
spread across different LLCs. So with the spreading fix, the
hackbench might be put to different CPUs. Anyway, I'll run more
rounds of testing to check whether this is consistent or merely
due to run-to-run variance. And I'll try other workloads besides
hackbench. Or do you have suggestion on what workload we can try,
which is sensitive to nohz cpumask access(I chose hackbench because
I found Shrikanth was using hackbench for nohz evaluation in
commit 5d86d542f6)

thanks,
Chenyu


CPUs