RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention

From: Deng, Pan

Date: Tue Mar 24 2026 - 05:41:49 EST

> On Mon, Jul 21, 2025 at 02:10:23PM +0800, Pan Deng wrote:
> > When running a multi-instance FFmpeg workload on an HCC system,
> significant
> > cache line contention is observed around `cpupri_vec->count` and `mask` in
> > struct root_domain.
> >
> > The SUT is a 2-socket machine with 240 physical cores and 480 logical
> > CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
> > (8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
> > with FIFO scheduling. FPS is used as score.
> >
> > perf c2c tool reveals:
> > root_domain cache line 3:
> > - `cpupri->pri_to_cpu[0].count` (offset 0x38) is heavily loaded/stored
> > and contends with other fields, since counts[0] is more frequently
> > updated than others along with a rt task enqueues an empty runq or
> > dequeues from a non-overloaded runq.
> > - cycles per load: ~10K to 59K
> >
> > cpupri's last cache line:
> > - `cpupri_vec->count` and `mask` contends. The transcoding threads use
> > rt pri 99, so that the contention occurs in the end.
> > - cycles per load: ~1.5K to 10.5K
> >
> > This change mitigates `cpupri_vec->count`, `mask` related contentions by
> > separating each count and mask into different cache lines.
>
> Right.
>
> > Note: The side effect of this change is that struct cpupri size is
> > increased from 26 cache lines to 203 cache lines.
>
> That is pretty horrible, but probably unavoidable.
>
> > An alternative implementation of this patch could be separating `counts`
> > and `masks` into 2 vectors in cpupri_vec (counts[] and masks[]), and
> > add two paddings:
> > 1. Between counts[0] and counts[1], since counts[0] is more frequently
> > updated than others.
>
> That is completely workload specific; it is a direct consequence of your
> (probably busted) priority assignment scheme.
>
> > 2. Between the two vectors, since counts[] is read-write access while
> > masks[] is read access when it stores pointers.
> >
> > The alternative introduces the complexity of 31+/21- LoC changes,
> > it achieves almost the same performance, at the same time, struct cpupri
> > size is reduced from 26 cache lines to 21 cache lines.
>
> That is not an alternative, since it very specifically only deals with
> fifo-99 contention.
>
> > ---
> > kernel/sched/cpupri.h | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
> > index d6cba0020064..245b0fa626be 100644
> > --- a/kernel/sched/cpupri.h
> > +++ b/kernel/sched/cpupri.h
> > @@ -9,7 +9,7 @@
> >
> > struct cpupri_vec {
> > atomic_t count;
> > - cpumask_var_t mask;
> > + cpumask_var_t mask ____cacheline_aligned;
> > };
>
> At the very least this needs a comment, explaining the what and how of
> it.

Hi Peter,

Thank you very much for helping look at this patch series.
Before digging into the details, please let me briefly describe the
structure of this patch set.
Each patch builds incrementally on the previous ones, with patch 1
improving performance by 11%, patch 1+2 improving by 12%, patch 1+2+3
improving by 13%, and patch 1+2+3+4 by 16%.
Since patch 1 gives the most benefit and is simple enough, we are
planning to address the first issue in patch 1 and try to push this
patch first, then address your comments in remained patches.
We'll investigate a more generic method to solve the global contention
issue as you proposed in patch 3 and patch 4, and we are planning to do
that on multi-LLC system as well(Intel and AMD).
Regarding this patch, yes, using cacheline aligned could increase potential
memory usage.
After internal discussion, we are thinking of an alternative method to
mitigate the waste of memory usage, that is, using kmalloc() to allocate
count in a different memory space rather than placing the count and
cpumask together in this structure. The rationale is that, writing to
address pointed by the counter and reading the address from cpumask
is isolated in different memory space which could reduce the ratio of
cache false sharing, besides, kmalloc() based on slub/slab could place
the objects in different cache lines to reduce the cache contention.
The drawback of dynamic allocation counter is that, we have to maintain
the life cycle of the counters.
Could you please advise if sticking with current cache_align attribute
method or using kmalloc() is preferred?