Re: [RFC PATCH 0/4] Optimize rss_stat initialization/teardown for single-threaded tasks
From: Mathieu Desnoyers
Date: Mon Dec 01 2025 - 09:48:37 EST
On 2025-12-01 06:31, Mateusz Guzik wrote:
On Mon, Dec 1, 2025 at 11:39 AM Harry Yoo <harry.yoo@xxxxxxxxxx> wrote:
Apologies for not reposting it for a while. I have limited capacity to push
this forward right now, but FYI... I just pushed slab-destructor-rfc-v2r2-wip
branch after rebasing it onto the latest slab/for-next.
https://gitlab.com/hyeyoo/linux/-/commits/slab-destructor-rfc-v2r2-wip?ref_type=heads
nice, thanks. This takes care of majority of the needful(tm).
To reiterate, should something like this land, it is going to address
the multicore scalability concern for single-threaded processes better
than the patchset by Gabriel thanks to also taking care of cid. Bonus
points for handling creation and teardown of multi-threaded processes.
However, this is still going to suffer from doing a full cpu walk on
process exit. As I described earlier the current handling can be
massively depessimized by reimplementing this to take care of all 4
counters in each iteration, instead of walking everything 4 times.
This is still going to be slower than not doing the walk at all, but
it may be fast enough that Gabriel's patchset is no longer
justifiable.
But then the test box is "only" 256 hw threads, what about bigger boxes?
Given my previous note about increased use of multithreading in
userspace, the more concerned you happen to be about such a walk, the
more you want an actual solution which takes care of multithreaded
processes.
Additionally one has to assume per-cpu memory will be useful for other
facilities down the line, making such a walk into an even bigger
problem.
Thus ultimately *some* tracking of whether given mm was ever active on
a given cpu is needed, preferably cheaply implemented at least for the
context switch code. Per what I described in another e-mail, one way
to do it would be to coalesce it with tlb handling by changing how the
bitmap tracking is handled -- having 2 adjacent bits denote cpu usage
+ tlb separately. For the common case this should be almost the code
to set the two. Iteration for tlb shootdowns would be less efficient
but that's probably tolerable. Maybe there is a better way, I did not
put much thought into it. I just claim sooner or later this will need
to get solved. At the same time would be a bummer to add stopgaps
without even trying.
With the cpu tracking problem solved, check_mm would visit few cpus in
the benchmark (probably just 1) and it would be faster single-threaded
than the proposed patch *and* would retain that for processes which
went multithreaded.
Looking at this problem, it appears to be a good fit for rseq mm_cid
(per-mm concurrency ids). Let me explain.
I originally implemented the rseq mm_cid for userspace. It keeps track
of max_mm_cid = min(nr_threads, nr_allowed_cpus) for each mm, and lets
the scheduler select a current mm_cid value within the range
[0 .. max_mm_cid - 1]. With Thomas Gleixner's rewrite (currently in
tip), we even have hooks in thread clone/exit where we know when
max_mm_cid is increased/decreased for a mm. So we could keep track of
the maximum value of max_mm_cid over the lifetime of a mm.
So using mm_cid for per-mm rss counter would involve:
- Still allocating memory per-cpu on mm allocation (nr_cpu_ids), but
without zeroing all that memory (we eliminate a possible cpus walk on
allocation).
- Initialize CPU counters on thread clone when max_mm_cid is increased.
Keep track of the max value of max_mm_cid over mm lifetime.
- Rather than using the per-cpu accessors to access the counters, we
would have to load the per-task mm_cid field to get the counter index.
This would have a slight added overhead on the fast path, because we
would change a segment-selector prefix operation for an access that
depends on a load of the task struct current mm_cid index.
- Iteration on all possible cpus at process exit is replaced by an
iteration on mm maximum max_mm_cid, which will be bound by
the maximum value of min(nr_threads, nr_allowed_cpus) over the
mm lifetime. This iteration should be done with the new mm_cid
mutex held across thread clone/exit.
One more downside to consider is loss of NUMA locality, because the
index used to access the per-cpu memory would not take into account
the hardware topology. The index to topology should stay stable for
a given mm, but if we mix the memory allocation of per-cpu data
across different mm, then the NUMA locality would be degraded.
Ideally we'd have a per-cpu allocator with per-mm arenas for mm_cid
indexing if we care about NUMA locality.
So let's say you have a 256-core machine, where cpu numbers can go
from 0 to 255, with a 4-thread process, mm_cid will be limited to
the range [0..3]. Likewise if there are tons of threads in a process
limited to a few cores (e.g. pinned on cores from 10 to 19), which
will limit the range to [0..9].
This approach solves the runtime overhead issue of zeroing per-cpu
memory for all scenarios:
* single-threaded: index = 0
* nr_threads < nr_cpu_ids
* nr_threads < nr_allowed_cpus: index = [0 .. nr_threads - 1]
* nr_threads >= nr_allowed_cpus: index = [0 .. nr_allowed_cpus - 1]
* nr_threads >= nr_cpus_ids: index = [0 .. nr_cpu_ids - 1]
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com