Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: K Prateek Nayak

Date: Tue Mar 31 2026 - 06:28:25 EST


Hello Chenyu,

On 3/31/2026 11:07 AM, Chen, Yu C wrote:
> update of the test:
> With above change, I did a simple hackbench test on
> a system with multiple LLCs within 1 node, so the benefit
> is significant(+12%~+30%) when system is under-loaded, while
> some regression when overloaded(-10%)(need to figure out)

Could it be because of how we are traversing the CPUs now for idle load
balancing? Since we use the first set bit for ilb_cpu and also staring
balancing from that very CPu, we might just stop after a successful
balance on the ilb_cpu.

Would something like below on top of Peter's suggestion + your fix help?

(lightly tested; Has survived sched messaging on baremetal)

diff --git a/include/linux/sbm.h b/include/linux/sbm.h
index 8beade6c0585..98c4c1866534 100644
--- a/include/linux/sbm.h
+++ b/include/linux/sbm.h
@@ -76,8 +76,45 @@ static inline bool sbm_cpu_test(struct sbm *sbm, int cpu)
return __sbm_op(sbm, test_bit);
}

+static __always_inline
+unsigned int sbm_find_next_bit_wrap(struct sbm *sbm, int start)
+{
+ int bit = sbm_find_next_bit(sbm, start);
+
+ if (bit >= 0 || start == 0)
+ return bit;
+
+ bit = sbm_find_next_bit(sbm, 0);
+ return bit < start ? bit : -1;
+}
+
+static __always_inline
+unsigned int __sbm_for_each_wrap(struct sbm *sbm, int start, int n)
+{
+ int bit;
+
+ /* If not wrapped around */
+ if (n > start) {
+ /* and have a bit, just return it. */
+ bit = sbm_find_next_bit(sbm, n);
+ if (bit >= 0)
+ return bit;
+
+ /* Otherwise, wrap around and ... */
+ n = 0;
+ }
+
+ /* Search the other part. */
+ bit = sbm_find_next_bit(sbm, n);
+ return bit < start ? bit : -1;
+}
+
#define sbm_for_each_set_bit(sbm, idx) \
for (int idx = sbm_find_next_bit(sbm, 0); \
idx >= 0; idx = sbm_find_next_bit(sbm, idx+1))

+#define sbm_for_each_set_bit_wrap(sbm, idx, start) \
+ for (int idx = sbm_find_next_bit_wrap(sbm, start); \
+ idx >= 0; idx = __sbm_for_each_wrap(sbm, start, idx+1))
+
#endif /* _LINUX_SBM_H */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a3a423c4706e..f485afb6286d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -12916,6 +12916,7 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
int this_cpu = this_rq->cpu;
int balance_cpu;
struct rq *rq;
+ u32 start;

WARN_ON_ONCE((flags & NOHZ_KICK_MASK) == NOHZ_BALANCE_KICK);

@@ -12944,7 +12945,8 @@ static void _nohz_idle_balance(struct rq *this_rq, unsigned int flags)
* Start with the next CPU after this_cpu so we will end with this_cpu and let a
* chance for other idle cpu to pull load.
*/
- sbm_for_each_set_bit(nohz.sbm, idx) {
+ start = arch_sbm_cpu_to_idx((this_cpu + 1) % nr_cpu_ids);
+ sbm_for_each_set_bit_wrap(nohz.sbm, idx, start) {
balance_cpu = arch_sbm_idx_to_cpu(idx);

if (!idle_cpu(balance_cpu))
---

This is pretty much giving me similar performance as tip for sched
messaging runs under heavy load but your mileage may vary :-)

--
Thanks and Regards,
Prateek