Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention

Next message: Lukas Bulwahn: "[PATCH] MAINTAINERS: Remove obsolete file entry in DMA BUFFER SHARING FRAMEWORK"
Previous message: Xilin Wu: "[PATCH v2 1/2] dt-bindings: pwm: clk-pwm: add optional GPIO and pinctrl properties"
Next in thread: Deng, Pan: "RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Chen, Yu C

Date: Wed Apr 08 2026 - 06:17:22 EST

On 7/21/2025 2:10 PM, Pan Deng wrote:

When running a multi-instance FFmpeg workload on an HCC system, significant
cache line contention is observed around `cpupri_vec->count` and `mask` in
struct root_domain.

The SUT is a 2-socket machine with 240 physical cores and 480 logical
CPUs. 60 FFmpeg instances are launched, each pinned to 4 physical cores
(8 logical CPUs) for transcoding tasks. Sub-threads use RT priority 99
with FIFO scheduling. FPS is used as score.

[ ... ]

As a result:
- FPS improves by ~11%
- Kernel cycles% drops from ~20% to ~11%
- `count` and `mask` related cache line contention is mitigated, perf c2c
shows root_domain cache line 3 `cycles per load` drops from ~10K-59K
to ~0.5K-8K, cpupri's last cache line no longer appears in the report.
- stress-ng cyclic benchmark is improved ~31.4%, command:
stress-ng/stress-ng --cyclic $(nproc) --cyclic-policy fifo \
--timeout 30 --minimize --metrics
- rt-tests/pi_stress is improved ~76.5%, command:
rt-tests/pi_stress -D 30 -g $(($(nproc) / 2))

According to your test results above, this original proposal seems
simple enough. It provides a general benefit, not only for FFmpeg workloads
with "unusual" CPU affinity settings, but also for other common workloads
that do not use CPU affinity or partitioning.
I still prefer this proposal. Later we can rebase patch 4 on top of sbm
to see if it brings further improvements. patch 1 and patch 4 could form a
patch series IMHO.

thanks,
Chenyu

diff --git a/kernel/sched/cpupri.h b/kernel/sched/cpupri.h
index d6cba0020064..245b0fa626be 100644
--- a/kernel/sched/cpupri.h
+++ b/kernel/sched/cpupri.h
@@ -9,7 +9,7 @@
struct cpupri_vec {
atomic_t count;
- cpumask_var_t mask;
+ cpumask_var_t mask ____cacheline_aligned;
};
struct cpupri {