Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks

From: Mathieu Desnoyers

Date: Fri Nov 28 2025 - 08:30:15 EST


On 2025-11-27 18:36, Gabriel Krisman Bertazi wrote:
The cost of the pcpu memory allocation is non-negligible for systems
with many cpus, and it is quite visible when forking a new task, as
reported in a few occasions.
I've come to the same conclusion within the development of
the hierarchical per-cpu counters.

But while the mm_struct has a SLAB cache (initialized in
kernel/fork.c:mm_cache_init()), there is no such thing
for the per-mm per-cpu data.

In the mm_struct, we have the following per-cpu data (please
let me know if I missed any in the maze):

- struct mm_cid __percpu *pcpu_cid (or equivalent through
struct mm_mm_cid after Thomas Gleixner gets his rewrite
upstream),

- unsigned int __percpu *futex_ref,

- NR_MM_COUNTERS rss_stats per-cpu counters.

What would really reduce memory allocation overhead on fork
is to move all those fields into a top level
"struct mm_percpu_struct" as a first step. This would
merge 3 per-cpu allocations into one when forking a new
task.

Then the second step is to create a mm_percpu_struct
cache to bypass the per-cpu allocator.

I suspect that by doing just that we'd get most of the
performance benefits provided by the single-threaded special-case
proposed here.

I'm not against special casing single-threaded if it's still
worth it after doing the underlying data structure layout/caching
changes I'm proposing here, but I think we need to fix the
memory allocation overhead issue first before working around it
with special cases and added complexity.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com