Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: K Prateek Nayak

Date: Thu Apr 02 2026 - 07:09:42 EST


Hello Peter,

On 4/2/2026 4:25 PM, Peter Zijlstra wrote:
> On Thu, Apr 02, 2026 at 10:11:11AM +0530, K Prateek Nayak wrote:
>
>> It is still not super clear to me how the logic deals with more than
>> 128CPUs in a DIE domain because that'll need more than the u64 but
>> sbm_find_next_bit() simply does:
>>
>> tmp = leaf->bitmap & mask; /* All are u64 */
>>
>> expecting just the u64 bitmap to represent all the CPUs in the leaf.
>>
>> If we have, say 256 CPUs per DIE, we get shift(7) and arch_sbm_mask
>> as 7f (127) which allows a leaf to more than 64 CPUs but we are
>> using the "u64 bitmap" directly and not:
>>
>> find_next_bit(bitmap, arch_sbm_mask)
>>
>> Am I missing something here?
>
> Nope. That logic just isn't there, that was left as an exercise to the
> reader :-)

Ack! Let me go fiddle with that.

>
> For AMD in particular it would be good to have one leaf per CCD, but
> since CCD are not enumerated in your topology (they really should be), I
> didn't do that.

We got the extended topology leaf 0x80000026 since 4th Generation EPYC
and we (well Thomas) added the parser support in v6.10 [1] so we can
discover the CCD boundary using that now ;-)

https://lore.kernel.org/all/20240314050432.1710-1-kprateek.nayak@xxxxxxx/

>
> Now, I seem to remember we had this discussion in the past some time,
> and you had some hacks available.

That, I believe, was for the NPS boundaries that we don't expose in NPS1
but CCX should be good enough.

>
> Anyway, the whole premise was to have one leaf/cacheline per cache, such
> that high frequency atomic ops set/clear bit, don't bounce the line
> around.
>
> I took the nohz bitmap, because it was relatively simple and is known to
> suffer from contention under certain workloads.

Ack! It would be better to tie it to the TOPO_TILE_DOMAIN then which
maps to the "CCX" on AMD and is the LLC boundary. CCD is just a
cluster of CCX that is nearby - mostly the dense core offerings
enumerate 2CCX per CCD.

--
Thanks and Regards,
Prateek