[PATCH 00/11] mm/zswap, zsmalloc: Per-memcg-lruvec zswap accounting
From: Joshua Hahn
Date: Wed Mar 11 2026 - 15:52:16 EST
INTRODUCTION
============
The current design for zswap and zsmalloc leaves a clean divide between
layers of the memory stack. At the higher level, we have zswap, which
interacts directly with memory consumers, compression algorithms, and
handles memory usage accounting via memcg limits. At the lower level,
we have zsmalloc, which handles the page allocation and migration of
physical pages.
While this logical separation simplifies the codebase, it leaves
problems for accounting that requires both memory cgroup awareness and
physical memory location. To name a few:
- On tiered systems, it is impossible to understand how much toptier
memory a cgroup is using, since zswap has no understanding of where
the compressed memory is physically stored.
+ With SeongJae Park's work to store incompressible pages as-is in
zswap [1], the size of compressed memory can become non-trivial,
and easily consume a meaningful portion of memory.
- cgroups that restrict memory nodes have no control over which nodes
their zswapped objects live on. This can lead to unexpectedly high
fault times for workloads, who must eat the remote access latency
cost of retrieving the compressed object from a remote node.
+ Nhat Pham addressed this issue via a best-effort attempt to place
compressed objects in the same page as the original page, but this
cannot guarantee complete isolation [2].
- On the flip side, zsmalloc's ignorance of cgroup also makes its
shrinker memcg-unaware, which can lead to ineffective reclaim when
pressure is localized to a single cgroup.
Until recently, zpool acted as another layer of indirection between
zswap and zsmalloc, which made bridging memcg and physical location
difficult. Now that zsmalloc is the only allocator backend for zswap and
zram [3], it is possible to move memory-cgroup accounting to the
zsmalloc layer.
Introduce a new per-zspage array of objcg pointers to track
per-memcg-lruvec memory usage by zswap, while leaving zram users
mostly unaffected.
In addition, move the accounting of memcg charges from the consumer
layer (zswap, zram) to the zsmalloc layer. Stat indices are
parameterized at pool creation time, meaning future consumers that wish
to account memory statistics can do so using the compressed object
memory accounting infrastructure introduced here.
PERFORMANCE
===========
The experiments were performed across 5 trials on a 2-NUMA machine.
Experiment 1
Node-bound workload, churning memory by allocating 2GB in 1GB cgroup.
0.638% regression, standard deviation: +/- 0.603%
Experiment 2:
Writeback with zswap pressure
0.295% gain, standard deviation: +/- 0.456%
Experiment 3:
1 cgroup, 2 workloads each bound to a NUMA node.
2.126% regression, standard deviation: +/- 3.008%
Experiment 4:
Reading memory.stat 10000x
1.464% gain, standard deviation: +/- 2.239%
Experiment 5:
Reading memory.numa_stat 10000x
0.281% gain, standard deviation: +/- 1.878%
It seems like all of the gains or regressions are mostly within the
standard deviation. I would like to note that workloads that span NUMA
nodes may see some contention as the zsmalloc migration path becomes
more expensive.
PATCH OUTLINE
=============
Patches 1 and 2 are small cleanups that make the codebase consistent and
easier to digest.
Patch 3 introduces memcg accounting-awareness to struct zs_pool, and
allows consumers to provide the memcg stat item indices that should be
accounted. The awareness is not functional at this point.
Patches 4, 5, and 6 allocate and populate the new zspage->objcgs field
with compressed objects' obj_cgroups. zswap_entry->objcgs is removed
and redirected to look at the zspage for memcg information.
Patch 7 moves the charging and lifetime management of obj_cgroups to the
zsmalloc layer, which leaves zswwap only as a plumbing layer to hand
cgroup information to zsmalloc at compression time.
Patches 8 and 9 introduce node counters and memcg-lruvec counters for
zswap.
Patches 10 and 11 handle charge migrations for the two types of compressed
object migration in zsmalloc. Special care is taken for compressed
objects that span multiple nodes.
CHANGELOG V1 [4] --> V2
=======================
A lot has changed from v1 and v2, thanks to the generous suggestions
from reviewers.
- Harry Yoo's suggestion to make the objcgs array per-zspage instead of
per-zpdesc simplified much of the code needed to handle boundary
cases. By moving the array to be per-zspage, much of the index
translation (from per-zspage to per-zpdesc) has been simplified. Note
that this does make the reverse true (per-zpdesc to per-zspage is
harder now), but the only case this really matters is during the
charge migration case in patch 10. Thank you Harry!
- Yosry Ahmed's suggestion to make memcg awareness a per-zspool decision
has simplified much of the #ifdef casing needed, which makes the code
a lot easier to follow (and makes changes less invasive for zram).
- Yosry Ahmed's suggestion to parameterize the memcg stat indices as
zs_pool parameter makes the awkward hardcoding of zswap stat indices
in zsmalloc code more natural & leaves room for future consumers to
follow. Thank you Yosry!
- Shakeel Butt's suggestion to turn the objcgs array from an unsigned
long to an objcgs ** pointer made the code much cleaner. However,
after moving the pointer from zpdesc to zspage, there is now no longer
a need to tag the pointer. Thank you, Shakeel!
- v1 only handled the migration case for single compressed objects.
Patch 10 in v2 is written to handle the migration case for zpdesc
replacement.
+ Special-casing compressed objects living at the boundary are a tad
bit harder with per-zspage objcgs. I felt that this difficulty was
outweighed by the simplification in the "typical" write/free case,
though.
REVIEWERS NOTE
==============
Patches 10 and 11 are a bit hairy, since they have to deal with special
casing scenarios for objects that span pages. I originally implemented a
very simple approach which uses the existing zs_charge_objcg functions,
but later realized that these migration paths take spin locks and
therefore cannot accept obj_cgroup_charge going to sleep.
The workaround is less elegant, but gets the job done. Feedback on these
two commits would be greatly appreciated!
[1] https://lore.kernel.org/linux-mm/20250822190817.49287-1-sj@xxxxxxxxxx/
[2] https://lore.kernel.org/linux-mm/20250402204416.3435994-1-nphamcs@xxxxxxxxx/#t3
[3] https://lore.kernel.org/linux-mm/20250829162212.208258-1-hannes@xxxxxxxxxxx/
[4] https://lore.kernel.org/all/20260226192936.3190275-1-joshua.hahnjy@xxxxxxxxx/
Joshua Hahn (11):
mm/zsmalloc: Rename zs_object_copy to zs_obj_copy
mm/zsmalloc: Make all obj_idx unsigned ints
mm/zsmalloc: Introduce conditional memcg awareness to zs_pool
mm/zsmalloc: Introduce objcgs pointer in struct zspage
mm/zsmalloc: Store obj_cgroup pointer in zspage
mm/zsmalloc, zswap: Redirect zswap_entry->objcg to zspage
mm/zsmalloc, zswap: Handle objcg charging and lifetime in zsmalloc
mm/memcontrol: Track MEMCG_ZSWAPPED in bytes
mm/vmstat, memcontrol: Track ZSWAP_B, ZSWAPPED_B per-memcg-lruvec
mm/zsmalloc: Handle single object charge migration in migrate_zspage
mm/zsmalloc: Handle charge migration in zpdesc substitution
drivers/block/zram/zram_drv.c | 10 +-
include/linux/memcontrol.h | 20 +-
include/linux/mmzone.h | 2 +
include/linux/zsmalloc.h | 9 +-
mm/memcontrol.c | 75 ++-----
mm/vmstat.c | 2 +
mm/zsmalloc.c | 381 ++++++++++++++++++++++++++++++++--
mm/zswap.c | 66 +++---
8 files changed, 431 insertions(+), 134 deletions(-)
--
2.52.0