Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: K Prateek Nayak

Date: Thu Apr 02 2026 - 00:41:36 EST

Hello Chenyu,

Thank you for testing the changes! Much appreciated.

On 4/2/2026 8:45 AM, Chen, Yu C wrote:
> One suspicion is that with sbm enabled(without your patch), more
> tasks are "aggregated" onto the first CPU(or maybe the front part)
> in nohz.sbm, because sbm_for_each_set_bit() always picks the first
> idle CPU to pull work. As we already know, hackbench on our
> platform strongly prefers being aggregated rather than being
> spread across different LLCs. So with the spreading fix, the
> hackbench might be put to different CPUs.

Ack! But I cannot seem to come up with a theory on why it would be any
worse than original.

P.S. what does your SBM log in the dmesg look like? On my 3rd Generation
EPYC machine (2 x 64C/128T) it looks like:

CPU topo: SBM: shift(6) leafs(4) APIC(ff)

Now, I suppose I get 4 leaves because I have 128CPUs per socket
(2 x u64 per socket) but it is not super how it is achieved from
doing:

arch_sbm_shift = x86_topo_system.dom_shifts[TOPO_DIE_DOMAIN] - 1;

that divides TOPO_DIE_DOMAIN into two but that should only be okay
until 128CPUs per DIE.

It is still not super clear to me how the logic deals with more than
128CPUs in a DIE domain because that'll need more than the u64 but
sbm_find_next_bit() simply does:

tmp = leaf->bitmap & mask; /* All are u64 */

expecting just the u64 bitmap to represent all the CPUs in the leaf.

If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
as 7f (127) which allows a leaf to more than 64 CPUs but we are
using the "u64 bitmap" directly and not:

find_next_bit(bitmap, arch_sbm_mask)

Am I missing something here?

AMD got the 0x80000026 for defining TOPO_DIE_DOMAIN as soon as we
crossed 256CPUs per socket in 4th Generation EPYC so it'll have per
CCD (up to 2LLCs) smb leaves but if I'm not mistaken, some of the
SPR systems still advertised one large TILE / DIE domain.

I'm curious if your test system exposed multiple DIE per PKG since
280 logical CPUs per socket based on the cover letter would still go
beyond needing 64 bits if it is advertised as a single DIE.

> Anyway, I'll run more
> rounds of testing to check whether this is consistent or merely
> due to run-to-run variance. And I'll try other workloads besides
> hackbench. Or do you have suggestion on what workload we can try,
> which is sensitive to nohz cpumask access(I chose hackbench because
> I found Shrikanth was using hackbench for nohz evaluation in
> commit 5d86d542f6)

Most sensitive is schbench's tail latency when system is fully
loaded (#workers = #CPUs) but that data point also has large run
to run variation - I generally look for crazy jumps like the tail
latency turning 5-8x consistently across multiple runs before
actually concluding it is a regression.

hackbench (/ sched-messaging) should be good enough from a
throughput standpoint.

--
Thanks and Regards,
Prateek