Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

From: Nhat Pham

Date: Fri May 08 2026 - 20:08:26 EST

On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing many swap
> entries. This has been reported to significantly delay memory reclamation
> during Android's low-memory killing, especially when multiple processes
> are terminated to free memory, with slot_free() accounting for more than
> 80% of the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the design
> itself is fairly complex.
>
> When anon folios and swap entries are mixed within a process, reclaiming
> anon folios from killed processes helps return memory to the system as
> quickly as possible, so that newly launched applications can satisfy
> their memory demands. It is not ideal for swap freeing to block anon
> folio freeing. On the other hand, swap freeing can still return memory
> to the system, although at a slower rate due to memory compression.
>
> This series introduces a callback-based deferred free framework in
> zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> define what gets buffered and how it gets drained. The entire free
> path including caller-side bookkeeping (slot_free, zswap_entry_free)
> is deferred to a background worker.
>
> Implementation:
> - Each CPU owns a single-page buffer. The hot path writes a value
> via the push callback with preemption disabled (no locks).
> - When the buffer fills, it is swapped with a fresh page from a
> pre-allocated page pool. The full page is queued to a WQ_UNBOUND
> worker for drain.
> - The drain callback performs the actual expensive work (zs_free,
> slot_free, zswap_entry_free, etc.) in batch, off the hot path.
> - If no free page is available, the caller falls back to synchronous
> processing.
>
> The speedup comes from moving expensive swap slot freeing off the
> munmap hot path into a background worker, so that intact anonymous
> folios are released back to the system without blocking. The worker
> drains at a slower rate since compressed objects are small and freeing
> a single handle may not release an entire page until the zspage is
> fully empty.
>
> Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
>
> Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
>
> mode Base Patched Speedup
> single 61.82ms 8.62ms 7.17x
> multi 2p 94.75ms 54.11ms 1.75x
> multi 3p 154.64ms 104.83ms 1.48x
>
> Test 2: munmap latency for different sizes (zram, single process)
>
> Size Base Patched Speedup
> 64MB 14.11ms 2.18ms 6.47x
> 128MB 29.45ms 4.48ms 6.57x
> 192MB 43.85ms 6.62ms 6.62x
> 256MB 57.01ms 9.08ms 6.28x
> 512MB 115.13ms 55.58ms 2.07x
> 1024MB 229.66ms 153.28ms 1.50x
>
> Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
>
> mode Base Patched Speedup
> single 152.14ms 51.26ms 2.97x
> multi 2p 186.56ms 105.42ms 1.77x
> multi 3p 205.83ms 153.32ms 1.34x
>
> Test 4: munmap latency for different sizes (zswap, single process)
>
> Size Base Patched Speedup
> 64MB 37.83ms 13.26ms 2.85x
> 128MB 75.11ms 26.73ms 2.81x
> 256MB 150.78ms 52.97ms 2.85x
> 512MB 303.04ms 130.38ms 2.32x
> 1024MB 599.95ms 287.10ms 2.09x
>

Hmmm, why are we batching at the zswap/zsmalloc level like this? I
agree with Yosry that this seems like somewhat of an unnecessary
layering violation. For example, do we observe a lot more performance
wins by doing this instead of just simply:

static void zswap_entry_free(swp_entry_t swp, bool deferred)
{
...
if (!deferred || !zs_deferred_free(entry->pool->zs_pool , entry->handle))
zs_free(entry->pool->zs_pool , entry->handle);
}

(basically what you had in the last version).

One weird effect of doing deferred zswap entry freeing like what you
are proposing here, is that the zswap LRU will be littered with stale
zswap entries. Seems like you removed them from the zswap xarray, but
they're still linked into the zswap LRU? At writeback time, that will
throw off the statistics used in the heuristics, and will make
writeback go through a bunch of stale entries, wasting more cycles :)
Seems a bit inelegant, no?