Re: [PATCH] mm: use per-numa-node atomics instead of percpu_counters

From: Mateusz Guzik
Date: Wed Mar 26 2025 - 19:37:29 EST


On Tue, Mar 25, 2025 at 06:15:49PM -0400, Sweet Tea Dorminy wrote:
> From: Sweet Tea Dorminy <sweettea@xxxxxxxxxx>
>
> This was a result of f1a7941243c1 ("mm: convert mm's rss stats into
> percpu_counter") [1]. Previously, the memory error was bounded by
> 64*nr_threads pages, a very livable megabyte. Now, however, as a result of
> scheduler decisions moving the threads around the CPUs, the memory error could
> be as large as a gigabyte.
>
> This is a really tremendous inaccuracy for any few-threaded program on a
> large machine and impedes monitoring significantly. These stat counters are
> also used to make OOM killing decisions, so this additional inaccuracy could
> make a big difference in OOM situations -- either resulting in the wrong
> process being killed, or in less memory being returned from an OOM-kill than
> expected.
>
> Finally, while the change to percpu_counter does significantly improve the
> accuracy over the previous per-thread error for many-threaded services, it does
> also have performance implications - up to 12% slower for short-lived processes
> and 9% increased system time in make test workloads [2].
>
> A previous attempt to address this regression by Peng Zhang [3] used a hybrid
> approach with delayed allocation of percpu memory for rss_stats, showing
> promising improvements of 2-4% for process operations and 6.7% for page
> faults.
>
> This RFC takes a different direction by replacing percpu_counters with a
> more efficient set of per-NUMA-node atomics. The approach:
>
> - Uses one atomic per node up to a bound to reduce cross-node updates.
> - Keeps a similar batching mechanism, with a smaller batch size.
> - Eliminates the use of a spin lock during batch updates, bounding stat
> update latency.
> - Reduces percpu memory usage and thus thread startup time.
>
> Most importantly, this bounds the total error to 32 times the number of NUMA
> nodes, significantly smaller than previous error bounds.
>
> On a 112-core machine, lmbench showed comparable results before and after this
> patch. However, on a 224 core machine, performance improvements were
> significant over percpu_counter:
> - Pagefault latency improved by 8.91%
> - Process fork latency improved by 6.27%
> - Process fork/execve latency improved by 6.06%
> - Process fork/exit latency improved by 6.58%
>
> will-it-scale also showed significant improvements on these machines.
>

The problem on fork/exec/exit stems from back-to-back trips to the
per-cpu allocator every time a mm is allocated/freed (which happens for
each of these syscalls) -- they end up serializing on the same global
spinlock.

On the alloc side this is mm_alloc_cid() followed by percpu_counter_init_many().

Even if you eliminate the counters for rss, you are still paying for CID. While
this scales better than the stock kernel, it still leaves perf on the table.

Per our discussion on IRC there is WIP to eliminate both cases by
caching the state in mm. This depends on adding a dtor for SLUB to undo
the work in ctor. Harry did the work on that front, this is not
submitted to -next though.

There is a highly-inefficient sanity-check loop in check_mm(). Instead
of walking the entire list 4 times with toggling interrupts in-between,
it can do the walk once.

So that's for the fork/execve/exit triplets.

As for the page fault latency, your patch adds atomics to the fast path.
Even absent any competition for cachelines with other CPUs this will be
slower to execute than the current primitive. I suspect you are
observing a speed up with your change because you end up landing in the
slowpath a lot and that sucker is globally serialized on a spinlock --
this has to hurt.

Per my other message in the thread, and a later IRC discussion, this is
fixable with adding intermediate counters, in the same spirit you did here.

I'll note though that numa nodes can be incredibly core-y and that
granularity may be way too coarse.

That aside there are globally-locked lists mms are cycling in and out of
which also can get the "stay there while cached" treatment.

All in all I claim that:
1. fork/execve/exit tests will do better than they are doing with your
patch if going to the percpu allocator gets eliminated altogether (in
your patch it is not for mm_alloc_cid() and the freeing counterpart),
along with unscrewing the loop in check_mm().
2. fault handling will be faster than it is with your patch *if*
something like per-numa state gets added for the slowpath -- the stock
fast path is faster than yours, the stock slowpath is way slower. you
can get the best of both worlds on this one.

Hell, it may be your patch as is can be easily repurposed to
decentralize the main percpu counter? I mean perhaps there is no need
for any fancy hierarchical structure.

I can commit to providing a viable patch for sorting out the
fork/execve/exit side, but it is going to take about a week. You do have
a PoC in the meantime (too ugly to share publicly :>).

So that's my take on it. Note I'm not a maintainer of any of this, but I
did some work on the thing in the past.