Re: [PATCH v4 0/5] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

From: Joshua Hahn

Date: Fri Jun 26 2026 - 16:19:09 EST


On Tue, 23 Jun 2026 11:01:18 -0700 Joshua Hahn <joshua.hahnjy@xxxxxxxxx> wrote:

> This series is intended for the next release cycle.
>
> v3 --> v4
> =========
> - Reduced memory footprint by 4x, from 16 bytes per-(cpu x memcg) to
> 4 bytes per-(cpu x memcg). Each page_counter_stock is a thin wrapper
> around an atomic_t.
> - Removed locking completely and uses atomic operations to use stock.
> - Removed synchronous work_on_cpu. All work is done via remote
> atomic_xchgs.
> - Added a patch to flatten page_counter charging in try_charge_memcg
> - Split page_counter_try_charge into stocked and non-stocked variants.
>
> INTRO
> =====
> Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
> allocations, allowing small and frequent allocations to avoid walking
> the expensive mem_cgroup hierarchy traversal each time. This fastpath
> offers real improvements, but there is room for improvement:
>
> 1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
> more than 7 mem_cgroups have stock present on a single CPU, a random
> victim is evicted and its associated stock is drained.
>
> 2. When one cgroup runs out of memory and needs to drain stock across
> all CPUs it has stock cached in, those CPUs will drain all other
> memcgs' stock present in that CPU. This leads to inefficient stock
> caching and cross-memcg interference under memory pressure.
>
> 3. Stock management is tightly coupled to struct mem_cgroup, which makes
> it difficult to add a new page_counter to mem_cgroup and have
> multiple sources of stock management.
>
> This series moves the per-cpu stock down into page_counter which
> consolidates stock limit checking and page_counter limit checking into
> page_counter_try_charge_stock. This eliminates the 7 memcg-per-cpu slot
> limit, the random cross-memcg stock drains, and slot traversal. We also
> simplify memcontrol code, since we no longer need to maintain separate
> draining functions or manage the asynchronous workqueue.

Hello,

I just want to address a few things that Sashiko raised. I think there
are definitely some improvements that I can make as Sashiko suggested.

In commit 3/5 mm/page_counter: introduce page_counter_try_charge_stock()
Sashiko raises two concerns.

"Can the per-CPU stock grow unboundedly beyond counter->batch pages here?"

I think this is true. I went back to the original stock design and saw
that when the stock is greater than the batch size, it just drains all
of it (since this means we raced). I can add the same check so that we
never grow beyond the batch size. This should also help with the point
below.

"Does moving the per-CPU cache from a single shared stock to a per-page_counter
stock fundamentally change the memory stranding bounds?"

This is true, and I addressed this in the cover letter. Yes, the worst-case
upper bound grows by quite a bit, but it is difficult to hit that limit
since it would require a memcg process to be scheduled on all the CPUs,
and strand memory there via the stock. Nonetheless, restricting the
batch size should make this worst-case a bit better.

In commit 4/5 mm/memcontrol: convert memcg to use page_counter_stock()
Sashiko also raises two concerns.

"Could this synchronous loop cause cacheline bouncing and premature OOM kills?"

Sashiko is referring to the memcg-cpu iteration we do where we drain the
stock of an entire descendant completely. I also addressed this in the
cover letter and that I couldn't really reproduce the issue in my
testing. I addressed this every version but it seems like Sashiko does
not read the cover letter :' (

"Would doing a volatile read to check if
the stock has pages before calling atomic_xchg() help mitigate this?"

This one I agree with, I'll add:

if (!atomic_read(&stock->nr_pages))
return;
nr_pages = atomic_xchg(&stock->nr_pages, 0);

And hopefully we can avoid most of the unnecsesary races (of course the
value can still change in between the read and the atomic_xchg but it's
just a best-effort optimization)

So I'll spin up a v5. One thing I'm going back and forth in my mind
is whether we want separate stocked and non-stocked variants, or if
that should just happen transparently within the calls.

Thanks again Sashiko!
Joshua