Re: [PATCH v2 4/4] sched/rt: Split cpupri_vec->cpumask to per NUMA node to reduce contention

From: Tim Chen

Date: Thu Apr 09 2026 - 19:10:06 EST


On Thu, 2026-04-09 at 10:47 +0530, K Prateek Nayak wrote:
> Hello Chenyu, Tim,
>
> On 4/8/2026 9:22 PM, K Prateek Nayak wrote:
> > Hello Chenyu,
> >
> > On 4/8/2026 5:05 PM, Chen, Yu C wrote:
> > > We haven't tried breaking it down further. One possible approach
> > > is to partition it at L2 scope, the benefit of which may depend on
> > > the workload.
> >
> > I fear at that point we'll have too many cachelines and too much
> > cache pollution when the CPU starts reading this at tick to schedule
> > a newidle balance.
> >
> > A 128 core system would bring in 128 * 64B = 8kB worth of data to
> > traverse the mask and at that point it becomes a trade off between
> > how fast you want reads vs writes and does it even speed up writes
> > after a certain point?
> >
> > Sorry I got distracted by some other stuff today but I'll share the
> > results from my experiments tomorrow.
>
> Here is some data from an experiments I ran on a 3rd Generation EPYC
> system (2 socket x 64C/128T (8LLCs per socket)):
>
> Experiment: Two threads pinned per-CPU on all CPUs yielding to each other
> and are operating on some cpumask - one setting the current CPU on the
> mask and other clearing the current CPU: Just an estimate of worst case
> scenario is we have to do one modification per sched-switch.
>
> I'm measuring total cycles taken for cpumask operations with following
> variants:
>
> %cycles vs global mask operation
>
> global mask : 100.0000% (var: 3.28%)
> per-NUMA mask : 32.9209% (var: 7.77%)
> per-LLC mask : 1.2977% (var: 4.85%)
> per-LLC mask (u8 operation; no LOCK prefix) : 0.4930% (var: 0.83%)
>
> per-NUMA split is 3X faster, per-LLC on this 16LLC machine is 77x faster
> and since there is enough space in the cacheline we can use a u8 to set
> and clear the CPu atomically without LOCK prefix and then do a >> 3 to
> get the CPU index from set bit which is 202x faster.
>
> If we use the u8 operations, we can only read 8CPUs per 8-byte load on
> 64-bit system but with per-LLC mask, we can scan all 16CPUs on the LLC
> with one 8-byte read and and per-NUMA one requires two 8-byte reads to
> scan the 128CPUs per socket.
>
> I think per-LLC mask (or, as Tim suggested, 64CPUs per cacheline) is
> a good tradeoff between the speedup vs amount of loads required to
> piece together the full cpumask. Thoughts?

I agree that per-LLC mask is a good compromise between minimizing loads
and offer good speed ups. I think we should get the LLC APICID
mask from 0x4 leaf (L1, L2, L3) instead of inferring from 0x1f leaf (Tile, Die ...etc)
for Intel. And the cache leaf I think is 0x8000_001D leaf for AMD.
Those are parsed in cacheinfo code and we can get it from there.

Tim