Re: [PATCH v4 00/12] mm, swap: swap table phase IV: unify allocation and reduce static metadata

From: Kairui Song

Date: Fri May 15 2026 - 09:42:36 EST

On Fri, May 15, 2026 at 6:15 PM Kairui Song via B4 Relay
<devnull+kasong.tencent.com@xxxxxxxxxx> wrote:
>
> From: Kairui Song <kasong@xxxxxxxxxxx>
>
> This series unifies the allocation and charging of anon and shmem swap
> in folios, provides better synchronization, consolidates the metadata
> management, hence dropping the static array and map, and improves the
> performance. The static metadata overhead is now close to zero, and
> workload performance is slightly improved.
>
> For example, mounting a 1TB swap device saves about 512MB of memory:
>
> Before:
> free -m
> total used free shared buff/cache available
> Mem: 1464 805 346 1 382 658
> Swap: 1048575 0 1048575
>
> After:
> free -m
> total used free shared buff/cache available
> Mem: 1464 277 899 1 356 1187
> Swap: 1048575 0 1048575
>
> Memory usage is ~512M lower, and we now have a close to 0 static
> overhead. It was about 2 bytes per slot before, now roughly 0.09375
> bytes per slot (48 bytes ci info per cluster, which is 512 slots).
>
> Performance test is also looking good, testing Redis in a 2G VM using
> 6G ZRAM as swap:
>
> valkey-server --maxmemory 2560M
> redis-benchmark -r 3000000 -n 3000000 -d 1024 -c 12 -P 32 -t get
>
> Before: 3385017.283654 RPS
> After: 3433309.307292 RPS (1.42% better)
>
> Testing with build kernel under global pressure on a 48c96t system,
> limiting the total memory to 8G, using 12G ZRAM, 24 test runs,
> enabling THP:
>
> make -j96, using defconfig
>
> Before: user time 2904.59s system time 4773.99s
> After: user time 2909.38s system time 4641.55s (2.77% better)
>
> Testing with usemem on a 32c machine using 48G brd ramdisk and 16G
> RAM, 12 test run:
>
> usemem --init-time -O -y -x -n 48 1G
>
> Before: Throughput (Sum): 6482.58 MB/s Free Latency: 371371.67us
> After: Throughput (Sum): 6539.28 MB/s Free Latency: 363059.88us
>
> Seems similar, or slightly better.
>
> This series also reduces memory thrashing, I no longer see any:
> "Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF", it was
> shown several times during stress testing before this series when under
> great pressure:
>
> Before: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 18
> After: grep -Ri VM_FAULT_OOM <test logs> | wc -l => 0
>
> Signed-off-by: Kairui Song <kasong@xxxxxxxxxxx>
> ---
> Changes in v4:
> - Rebased on latest mm-unstable and re-test, benchmark results are
> basically the same so mostly kept unchanged. Changes in v4 are code
> style and very minor behavior change.
> - Improve a few commit messages, rename a few variables as suggested by
> [ Chris Li ].
> - Rename thp_limit_gfp_mask to thp_shmem_limit_gfp_mask as suggested by
> [ Zi Yan ].
> - Cleanup a few allocation and code style issue [ YoungJun Park ]
> - Remove the forced fallback in swap_cache_alloc_folio, the caller will
> pass in the exact orders to be used. [ Baolin Wang ]

I thing I forgot to mention is that this will also provide better
infra from swap side for Usama's PMD swapin, now
swap_cache_alloc_folio(orders=<PMD ORDER>) will provide a stable PMD
sized folio that is ready to be used as swap cache folio for doing IO.

> - Rename swapin_entry to swapin_sync, it's only used by synchronization
> devices at this moment and describes what it does better
> [ David Hildenbrand ]

And the rename here is inspired from Fujunjie's ZSWAP series. This
series should also enable the implementation of more generic THP
(including zswap THP) support.