[PATCH v3 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata

From: Kairui Song via B4 Relay

Date: Tue Apr 21 2026 - 02:23:26 EST


This series unifies the allocation and charging of anon and shmem swap
in folios, provides better synchronization, consolidates the metadata
management, hence dropping the static array and map, and improves the
performance. The static metadata overhead is now close to zero, and
workload performance is slightly improved.

For example, mounting a 1TB swap device saves about 512MB of memory:

Before:
free -m
total used free shared buff/cache available
Mem: 1464 805 346 1 382 658
Swap: 1048575 0 1048575

After:
free -m
total used free shared buff/cache available
Mem: 1464 277 899 1 356 1187
Swap: 1048575 0 1048575

Memory usage is ~512M lower, and we now have a close to 0 static
overhead. It was about 2 bytes per slot before, now roughly 0.09375
bytes per slot (48 bytes ci info per cluster, which is 512 slots).

Performance test is also looking good, testing Redis in a 1.5G VM using
5G ZRAM as swap:

valkey-server --maxmemory 2560M
redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get

Before: 3289011.918750 RPS
After: 3312087.142241 RPS (0.99% better)

Testing with build kernel under global pressure on a 48c96t system,
limiting the total memory to 8G, using 12G ZRAM, 24 test runs,
enabling THP:

make -j96, using defconfig

Before: user time 2904.59s system time 4773.99s
After: user time 2909.38s system time 4641.55s (2.77% better)

Testing with usemem on a 32c machine using 48G brd ramdisk and 16G
RAM, 12 test run:

usemem --init-time -O -y -x -n 48 1G

Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us
After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us

Seems similar, or slightly better.

This series also reduces memory thrashing, I no longer see any:
"Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was
shown several times during stress testing before this series when under
great pressure:

Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18
After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0

Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
---
Changes in v3:
- This is based on mm-unstable, also applies to mm-new, and has no
conflict with YoungJun's tier series, and only trivial conflict with
Baoquan's swapops due to filename change.
- Fix zero map build issue on 32 bit archs [ YoungJun Park ]
- Cleanup memcg table allocation helpers [ YoungJun Park ]
- Fix WARN for non NUMA build:
https://lore.kernel.org/linux-mm/CAMgjq7ANih7u7SJB8uWcQHS8XRJySNRc3ti9V-SVey0nGE3gLQ@xxxxxxxxxxxxxx/
- Improve of commit messages.
- Re-test several tests, the conclusion is the same as v2.
- Link to v2: https://patch.msgid.link/20260417-swap-table-p4-v2-0-17f5d1015428@xxxxxxxxxxx

Changes in v2:
- Drop the RFC prefix and also the RFC part.
- Now there is zero change to cgroup or refault tracking, RFC v1 changed
some cgroup behavior. To archive that v2 use a standalone memcg_table
for each cluster. It can be dropped or better optimized later if we
have a better solution. The performance gain is partly cancelled
compared to RFC v1 since we now need an extra allocation for free cluster
isolation and peak memory usage is 2 bytes higher. But still looking
good. That table size is accetable (1024 bytes), no RCU needed, and
fits for kmalloc. Even if we keep it as it is in the future,
it's still accetable.
- Link to v1: https://lore.kernel.org/r/20260220-swap-table-p4-v1-0-104795d19815@xxxxxxxxxxx

To: linux-mm@xxxxxxxxx
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Chris Li <chrisl@xxxxxxxxxx>
Cc: Kairui Song <kasong@xxxxxxxxxxx>
Cc: Kemeng Shi <shikemeng@xxxxxxxxxxxxxxx>
Cc: Nhat Pham <nphamcs@xxxxxxxxx>
Cc: Baoquan He <bhe@xxxxxxxxxx>
Cc: Barry Song <baohua@xxxxxxxxxx>
Cc: Youngjun Park <youngjun.park@xxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Yosry Ahmed <yosry@xxxxxxxxxx>
Cc: Chengming Zhou <chengming.zhou@xxxxxxxxx>
Cc: David Hildenbrand <david@xxxxxxxxxx>
Cc: Lorenzo Stoakes <ljs@xxxxxxxxxx>
Cc: Zi Yan <ziy@xxxxxxxxxx>
Cc: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Cc: Dev Jain <dev.jain@xxxxxxx>
Cc: Lance Yang <lance.yang@xxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxx>
Cc: Michal Hocko <mhocko@xxxxxxxxxx>
Cc: Roman Gushchin <roman.gushchin@xxxxxxxxx>
Cc: Shakeel Butt <shakeel.butt@xxxxxxxxx>
Cc: Muchun Song <muchun.song@xxxxxxxxx>
Cc: Suren Baghdasaryan <surenb@xxxxxxxxxx>
Cc: Axel Rasmussen <axelrasmussen@xxxxxxxxxx>
Cc: Qi Zheng <zhengqi.arch@xxxxxxxxxxxxx>
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: cgroups@xxxxxxxxxxxxxxx

---
Kairui Song (12):
mm, swap: simplify swap cache allocation helper
mm, swap: move common swap cache operations into standalone helpers
mm/huge_memory: move THP gfp limit helper into header
mm, swap: add support for stable large allocation in swap cache directly
mm, swap: unify large folio allocation
mm/memcg, swap: tidy up cgroup v1 memsw swap helpers
mm, swap: support flexible batch freeing of slots in different memcgs
mm, swap: delay and unify memcg lookup and charging for swapin
mm, swap: consolidate cluster allocation helpers
mm/memcg, swap: store cgroup id in cluster table directly
mm/memcg: remove no longer used swap cgroup array
mm, swap: merge zeromap into swap table

MAINTAINERS | 1 -
include/linux/huge_mm.h | 30 +++
include/linux/memcontrol.h | 16 +-
include/linux/swap.h | 19 +-
include/linux/swap_cgroup.h | 47 ----
mm/Makefile | 3 -
mm/huge_memory.c | 2 +-
mm/internal.h | 11 +-
mm/memcontrol-v1.c | 66 +++---
mm/memcontrol.c | 32 +--
mm/memory.c | 88 ++------
mm/page_io.c | 58 ++++-
mm/shmem.c | 122 +++--------
mm/swap.h | 91 +++-----
mm/swap_cgroup.c | 172 ---------------
mm/swap_state.c | 516 +++++++++++++++++++++++++-------------------
mm/swap_table.h | 169 ++++++++++++---
mm/swapfile.c | 212 +++++++++---------
mm/vmscan.c | 2 +-
mm/zswap.c | 25 +--
20 files changed, 783 insertions(+), 899 deletions(-)
---
base-commit: f1541b40cd422d7e22273be9b7e9edfc9ea4f0d7
change-id: 20260111-swap-table-p4-98ee92baa7c4

Best regards,
--
Kairui Song <kasong@xxxxxxxxxxx>