Re: [RFC PATCH v2 2/2] mm: convert mm's rss stats to use atomic mode

From: Mateusz Guzik
Date: Fri May 17 2024 - 14:09:15 EST


On Fri, May 17, 2024 at 11:29:57AM +0800, Kairui Song wrote:
> Mateusz Guzik <mjguzik@xxxxxxxxx> 于 2024年5月16日周四 23:14写道:
> > A part of The Real Solution(tm) would make counter allocations scale
> > (including mcid, not just rss) or dodge them (while maintaining the
> > per-cpu distribution, see below for one idea), but that boils down to
> > balancing scalability versus total memory usage. It is trivial to just
> > slap together a per-cpu cache of these allocations and have the problem
> > go away for benchmarking purposes, while being probably being too memory
> > hungry for actual usage.
> >
> > I was pondering an allocator with caches per some number of cores (say 4
> > or 8). Microbenchmarks aside I suspect real workloads would not suffer
> > from contention at this kind of granularity. This would trivially reduce
> > memory usage compared to per-cpu caching. I suspect things like
> > mm_struct, task_struct, task stacks and similar would be fine with it.
> >
> > Suppose mm_struct is allocated from a more coarse grained allocator than
> > per-cpu. Total number of cached objects would be lower than it is now.
> > That would also mean these allocated but not currently used mms could
> > hold on to other stuff, for example per-cpu rss and mcid counters. Then
> > should someone fork or exit, alloc/free_percpu would be avoided for most
> > cases. This would scale better and be faster single-threaded than the
> > current state.
>
> And what is the issue with using only one CPU cache, and flush on mm
> switch? No more alloc after boot, and the total (and fixed) memory
> usage is just about a few unsigned long per CPU, which should be even
> lower that the old RSS cache solution (4 unsigned long per task). And
> it scaled very well with many kinds of microbench or workload I've
> tested.
>
> Unless the workload keeps doing something like "alloc one page then
> switch to another mm", I think the performance will be horrible
> already due to cache invalidations and many switch_*()s, RSS isn't
> really a concern there.
>

I only skimmed through your patchset. I do think it has a legitimate
approach, but personally I would not do it like that due to the extra
work on context switches. However, I have 0 say about this, so you will
need to prod the mm overlords to get this moving forward.

Maybe I was not clear enough in my opening e-mail, so I'm going to
reiterate some bits: there are scalability problems in execve even with
your patchset or the one which uses atomics. One of them concerns
another bit which allocates per-cpu memory (the mcid thing).

I note that sorting it out would possibly also take care of the rss
problem, outlining an example approach above.