Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: Chen, Yu C

Date: Fri Apr 10 2026 - 01:52:20 EST

Hi Prateek, Tim,

On 4/10/2026 7:09 AM, Tim Chen wrote:

On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote:

Hello Chenyu, Tim,

On 4/8/2026 9:22 PM, K Prateek Nayak wrote:

Hello Chenyu,

On 4/8/2026 5:05 PM, Chen, Yu C wrote:

We haven't tried breaking it down further. One possible approach
is to partition it at L2 scope, the benefit of which may depend on
the workload.

I fear at that point we'll have too many cachelines and too much
cache pollution when the CPU starts reading this at tick to schedule
a newidle balance.

A 128 core system would bring in 128 * 64B = 8kB worth of data to
traverse the mask and at that point it becomes a trade off between
how fast you want reads vs writes and does it even speed up writes
after a certain point?

Sorry I got distracted by some other stuff today but I'll share the
results from my experiments tomorrow.

Here is some data from an experiments I ran on a 3rd Generation EPYC
system (2 socket x 64C/128T (8LLCs per socket)):

Experiment: Two threads pinned per-CPU on all CPUs yielding to each other
and are operating on some cpumask - one setting the current CPU on the
mask and other clearing the current CPU: Just an estimate of worst case
scenario is we have to do one modification per sched-switch.

I'm measuring total cycles taken for cpumask operations with following
variants:

%cycles vs global mask operation

global mask : 100.0000% (var: 3.28%)
per-NUMA mask : 32.9209% (var: 7.77%)
per-LLC mask : 1.2977% (var: 4.85%)
per-LLC mask (u8 operation; no LOCK prefix) : 0.4930% (var: 0.83%)

per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster
and since there is enough space in the cacheline we can use a u8 to set
and clear the CPu atomically without LOCK prefix and then do a >> 3 to
get the CPU index from set bit which is 202x faster.

If we use the u8 operations, we can only read 8CPUs per 8-byte load on
64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC
with one 8-byte read and and per-NUMA one requires two 8-byte reads to
scan the 128CPUs per socket.

I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
a good tradeoff between the speedup vs amount of loads required to
piece together the full cpumask. Thoughts?

Yes, making it per LLC should work well enough (for balancing) to
achieve optimal benefit. Let me run some similar tests to yours,plus
hackbench/schbench, to see what the results are.
BTW, on AMD systems, does the TILE domain always match the CCX where
L3 is shared? On Intel the DIE is not always mapped to a domain
where L3 is shared.

I agree that per-LLC mask is a good compromise between minimizing loads
and offer good speed ups. I think we should get the LLC APICID
mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc)
for Intel. And the cache leaf I think is 0x8000_001D leaf for AMD.
Those are parsed in cacheinfo code and we can get it from there.

Yes, let me check how we can leverage the l3 id for that.

thanks,
Chenyu