Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
From: Mathieu Desnoyers
Date: Fri Nov 28 2025 - 08:30:15 EST
On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
The cost of the pcpu memory allocation is non-negligible for systemsI've come to the same conclusion within the development of
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions.
the hierarchical per-cpu counters.
But while the mm_struct has a SLAB cache (initialized in
kernel/fork.c:mm_cache_init()), there is no such thing
for the per-mm per-cpu data.
In the mm_struct, we have the following per-cpu data (please
let me know if I missed any in the maze):
- struct mm_cid __percpu *pcpu_cid (or equivalent through
struct mm_mm_cid after Thomas Gleixner gets his rewrite
upstream),
- unsigned int __percpu *futex_ref,
- NR_MM_COUNTERS rss_stats per-cpu counters.
What would really reduce memory allocation overhead on fork
is to move all those fields into a top level
"struct mm_percpu_struct" as a first step. This would
merge 3 per-cpu allocations into one when forking a new
task.
Then the second step is to create a mm_percpu_struct
cache to bypass the per-cpu allocator.
I suspect that by doing just that we'd get most of the
performance benefits provided by the single-threaded special-case
proposed here.
I'm not against special casing single-threaded if it's still
worth it after doing the underlying data structure layout/caching
changes I'm proposing here, but I think we need to fix the
memory allocation overhead issue first before working around it
with special cases and added complexity.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com