Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
From: Chen, Yu C
Date: Wed Apr 08 2026 - 06:17:22 EST
On 7/21/2025 2:10 PM, Pan Deng wrote:
When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.
The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.
[ ... ]
As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
- stress-ng cyclic benchmark is improved ~31.4%, command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~76.5%, command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))
According to your test results above, this original proposal seems
simple enough. It provides a general benefit, not only for FFmpeg workloads
with "unusual" CPU affinity settings, but also for other common workloads
that do not use CPU affinity or partitioning.
I still prefer this proposal. Later we can rebase patch 4 on top of sbm
to see if it brings further improvements. patch 1 and patch 4 could form a
patch series IMHO.
thanks,
Chenyu
diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
struct cpupri_vec {
atomic_t count;
- cpumask_var_t mask;
+ cpumask_var_t mask ____cacheline_aligned;
};
struct cpupri {