[PATCH 0/7 v3] mm/memcontrol, page_counter: move stock from mem_cgroup to page_counter

From: Joshua Hahn

Date: Mon May 25 2026 - 15:05:07 EST

Memcg currently keeps a "stock" of 64 pages per-cpu to cache pre-charged
allocations, allowing small allocations to avoid walking the expensive
mem_cgroup hierarchy traversal and atomic operations on each charge.
This design introduces a fastpath, but there is room for improvement:

1. Currently, each CPU tracks up to 7 (NR_MEMCG_STOCK) mem_cgroups. When
more than 7 mem_cgroups are actively charging on a single CPU, a
random victim is evicted and its associated stock is drained.

2. Stock management is tightly coupled to struct mem_cgroup, which makes
it difficult to add a new page_counter to mem_cgroup and have
multiple sources of stock management, which is required when trying
to introduce fastpaths to multiple hard limit checks.

This series moves the per-cpu stock down into the page_counter which
consolidates stock limit checking and page_counter limit checking into
page_counter_try_charge. This eliminates the 7-memcg-per-cpu slot limit,
the random evictions (drain & refill), and slot traversal.

In turn, we can add independent stock management for additional
page_counters in each memcg, which is used in my tiered memory limits
series to add a new page_counter to track toptier usage [1].

The resulting code in memcg is also easier to follow, as the caching
becomes transparent from memcg's perspective and managed entirely within
page_counter.

There are, however, a few tradeoffs.

First, the bound on how much memory can be overcharged (and remain stale
as stock) is raised. Previously, it was fixed to nr_cpus x 7 x 64 pages.
Now, it becomes nr_leaf_cgroups x nr_cpus x 64 pages. On large machines
with many cgroups, this could be significant. There are three qualifying
points: (1) larger machines should be able to tolerate the additional
overhead, (2) the stock should not remain stale as long as the
cgroups are actively charging memory, and (3) a process would have to
migrate across all CPUs to incur this upper bound on overhead.

Secondly, we introduce some additional memory footprint. The new struct
page_counter_stock adds 2 words of extra overhead per-(cpu x memcg).

A small change is that for cgroupv1, reported memsw usage can be lower
than reported memory usage, if the memsw page_counter overcharges to its
stock whereas the memory page_counter does not.

Finally, to keep the above memory footprint limited, I opted to not
embed a work_struct into page_counter_stock, but rather decided to
trigger synchronous stock draining, since the drain operation is rarer
now, and only happens under memory pressure and on cgroup death.

Performance testing across single-cgroup, as well as 4-cgroup (under the
7 memcg limit) and 32-cgroup scenarios on a 40CPU, 50G memory system
shows negligible performance differences. In the tests, I repeatedly
fault and release anonymous pages using madvise(MADV_DONTNEED) to
stress the charge/uncharge path, across 40 trials of 50 iterations.
Metric here is time it took across each iteration (ms).

There are two testing versions below; the only difference is that v3
is based on top of mm-new, and v2 is based on top of mm-stable. The
"after" on both sides are similar, but mm-new and mm-stable have
different perforamnces.

v3, tested against mm-new
+----------+--------+-------+-----------+
| #cgroups | mm-new | after | delta (%) |
+----------+--------+-------+-----------+
| 1 | 357 | 358 | +0.283 |
| 4 | 1245 | 1214 | -2.430 |
| 32 | 9281 | 8970 | -3.470 |
+----------+--------+-------+-----------+

v2, tested against mm-stable
+----------+-----------+-------+-----------+
| #cgroups | mm-stable | after | delta (%) |
+----------+-----------+-------+-----------+
| 1 | 352 | 353 | +0.283 |
| 4 | 1198 | 1217 | +1.585 |
| 32 | 8980 | 9027 | +0.526 |
+----------+-----------+-------+-----------+

Further testing on other stress-ng microbenchmarks also agreed with
these results.

v2 --> v3:
- Rebased on top of latest mm-new, May 25, 2026, since the previous
version could not be applied for Sashiko review.
- Re-ran test numbers

v1 --> v2:
- Dropped stock returning on uncharge to preserve same behavior as memcg
stock. This resolves some race conditions present in v1.
- Fixed many race conditions between disabling page_counter_stock and
in-flight charges
- Restructured drain_all_stock to iterate over all CPUs first before
memcgs, to reduce the number of synchronous CPU work scheduling
- Optimized cgroup v2 further to drain only on the first child and skip
the root mem_cgroup
- Dropped RFC
- Wordsmithing cover letter

[1] https://lore.kernel.org/all/20260423203445.2914963-1-joshua.hahnjy@xxxxxxxxx/

Joshua Hahn (7):
mm/page_counter: introduce per-page_counter stock
mm/page_counter: use page_counter_stock in page_counter_try_charge
mm/page_counter: introduce stock drain APIs
mm/memcontrol: convert memcg to use page_counter_stock
mm/memcontrol: optimize memsw stock for cgroup v1
mm/memcontrol: optimize stock usage for cgroup v2
mm/memcontrol: remove unused memcg_stock code

include/linux/page_counter.h | 16 ++
mm/memcontrol.c | 289 +++++++----------------------------
mm/page_counter.c | 140 ++++++++++++++++-
3 files changed, 212 insertions(+), 233 deletions(-)

--
2.53.0-Meta