RE: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention

From: Deng, Pan

Date: Thu Apr 09 2026 - 07:48:13 EST

> According to your test results above, this original proposal seems
> simple enough. It provides a general benefit, not only for FFmpeg workloads
> with "unusual" CPU affinity settings, but also for other common workloads
> that do not use CPU affinity or partitioning.

Yes, exactly. FFmpeg and K8s are just example scenarios - the optimization
benefits any workload with RT thread contention. For instance, running
cyclictest on a 2-socket, 384-logical-core system:

"cyclictest -t -i200 -h 32 -m -p 95 -q"

This patch reduces both mean and max latency by at least 40%.

> I still prefer this proposal. Later we can rebase patch 4 on top of sbm
> to see if it brings further improvements. patch 1 and patch 4 could form a
> patch series IMHO.

Thank you for the feedback. I agree that patch 1 and patch 4 work well
together. Regarding the sbm discussion: we've observed promising results
in our sbm experiments, and I believe rebasing patch 4 on top of sbm would
likely show further improvements beyond the per-NUMA implementation. I'll
try this once the sbm implementation stabilizes.

Per Peter's previous request, I'm planning to add comments like this:
/*
* Separate mask to a different cacheline to mitigate contention
* between count (read-write) and mask (read-mostly when storing
* pointers). This alignment increases root_domain size by ~11KB,
* but eliminates cache line bouncing between cpupri_set() writers
* and cpupri_find_fitness() readers under heavy RT workloads.
*
* Memory overhead considerations:
* - Systems with cpuset partitions: each partition's root_domain is
* dynamically allocated (kalloc). The ~11KB overhead per partition
* scales with partition count, acceptable on servers using partitions.
* - Systems without partitions: only the static def_root_domain incurs
* the overhead, which is manageable for typical use.
*
* Additionally, this cacheline alignment ensures cpupri starts at a
* cacheline boundary, eliminating false sharing with root_domain's
* preceding fields (rto_mask, rto_loop_next, rto_loop_start).
*/
cpumask_var_t mask ____cacheline_aligned_in_smp;

Since this optimization is independent of the sbm work, would it be possible
to review this patch first? That would allow the sbm-related improvements
(patch 4) to build on top of this foundation once they're ready.

Best Regards
Pan