[RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

From: Wenchao Hao

Date: Fri May 08 2026 - 02:08:56 EST


Swap freeing can be expensive when unmapping a VMA containing many swap
entries. This has been reported to significantly delay memory reclamation
during Android's low-memory killing, especially when multiple processes
are terminated to free memory, with slot_free() accounting for more than
80% of the total cost of freeing swap entries.

Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
to asynchronously collect and free swap entries [1][2], but the design
itself is fairly complex.

When anon folios and swap entries are mixed within a process, reclaiming
anon folios from killed processes helps return memory to the system as
quickly as possible, so that newly launched applications can satisfy
their memory demands. It is not ideal for swap freeing to block anon
folio freeing. On the other hand, swap freeing can still return memory
to the system, although at a slower rate due to memory compression.

This series introduces a callback-based deferred free framework in
zsmalloc. Callers (zram, zswap) register push/drain callbacks to
define what gets buffered and how it gets drained. The entire free
path including caller-side bookkeeping (slot_free, zswap_entry_free)
is deferred to a background worker.

Implementation:
- Each CPU owns a single-page buffer. The hot path writes a value
via the push callback with preemption disabled (no locks).
- When the buffer fills, it is swapped with a fresh page from a
pre-allocated page pool. The full page is queued to a WQ_UNBOUND
worker for drain.
- The drain callback performs the actual expensive work (zs_free,
slot_free, zswap_entry_free, etc.) in batch, off the hot path.
- If no free page is available, the caller falls back to synchronous
processing.

The speedup comes from moving expensive swap slot freeing off the
munmap hot path into a background worker, so that intact anonymous
folios are released back to the system without blocking. The worker
drains at a slower rate since compressed objects are small and freeing
a single handle may not release an entire page until the zspage is
fully empty.

Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):

Test 1: munmap latency for 256MB swap-filled VMA (zram backend)

mode Base Patched Speedup
single 61.82ms 8.62ms 7.17x
multi 2p 94.75ms 54.11ms 1.75x
multi 3p 154.64ms 104.83ms 1.48x

Test 2: munmap latency for different sizes (zram, single process)

Size Base Patched Speedup
64MB 14.11ms 2.18ms 6.47x
128MB 29.45ms 4.48ms 6.57x
192MB 43.85ms 6.62ms 6.62x
256MB 57.01ms 9.08ms 6.28x
512MB 115.13ms 55.58ms 2.07x
1024MB 229.66ms 153.28ms 1.50x

Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)

mode Base Patched Speedup
single 152.14ms 51.26ms 2.97x
multi 2p 186.56ms 105.42ms 1.77x
multi 3p 205.83ms 153.32ms 1.34x

Test 4: munmap latency for different sizes (zswap, single process)

Size Base Patched Speedup
64MB 37.83ms 13.26ms 2.85x
128MB 75.11ms 26.73ms 2.81x
256MB 150.78ms 52.97ms 2.85x
512MB 303.04ms 130.38ms 2.32x
1024MB 599.95ms 287.10ms 2.09x

[1] https://lore.kernel.org/all/20240805153639.1057-1-justinjiang@xxxxxxxx/
[2] https://lore.kernel.org/all/20250909065349.574894-1-liulei.rjpt@xxxxxxxx/
[3] https://lore.kernel.org/linux-mm/20260412060450.15813-1-baohua@xxxxxxxxxx/

Changes since v2:
- Use per-cpu single-page buffers instead of a global list; the hot
path only writes into the local CPU's buffer with preemption disabled
- Add a page pool for buffer rotation: when the current buffer is full,
swap it with a free page from the pool and queue the full page for
drain
- Introduce push/drain callback ops so that zram and zswap can each
define their own element size and drain logic (zram stores u32 slot
indices, zswap stores unsigned long handles)
- Drop the lock optimization patches it will be submitted separately
as part of a dedicated zsmalloc lock contention series
- Link to v2: https://lore.kernel.org/all/20260421121616.3298845-1-haowenchao@xxxxxxxxxx/

Barry Song (1):
zram: use zsmalloc deferred free callback for async slot free

Wenchao Hao (3):
mm/zsmalloc: introduce deferred free framework with callback ops
mm/zswap: use zsmalloc deferred free callback for async invalidate
zram: batch clear flags in slot_free with single write

drivers/block/zram/zram_drv.c | 44 ++++++-
drivers/block/zram/zram_drv.h | 6 +
include/linux/zsmalloc.h | 16 +++
mm/zsmalloc.c | 208 +++++++++++++++++++++++++++++++++-
mm/zswap.c | 38 ++++++-
5 files changed, 306 insertions(+), 6 deletions(-)

--
2.34.1