Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

From: Wenchao Hao

Date: Sat May 09 2026 - 04:46:03 EST

On Sat, May 9, 2026 at 8:08 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> >
> > Swap freeing can be expensive when unmapping a VMA containing many swap
> > entries. This has been reported to significantly delay memory reclamation
> > during Android's low-memory killing, especially when multiple processes
> > are terminated to free memory, with slot_free() accounting for more than
> > 80% of the total cost of freeing swap entries.
> >
> > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > to asynchronously collect and free swap entries [1][2], but the design
> > itself is fairly complex.
> >
> > When anon folios and swap entries are mixed within a process, reclaiming
> > anon folios from killed processes helps return memory to the system as
> > quickly as possible, so that newly launched applications can satisfy
> > their memory demands. It is not ideal for swap freeing to block anon
> > folio freeing. On the other hand, swap freeing can still return memory
> > to the system, although at a slower rate due to memory compression.
> >
> > This series introduces a callback-based deferred free framework in
> > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > define what gets buffered and how it gets drained. The entire free
> > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > is deferred to a background worker.
> >
> > Implementation:
> > - Each CPU owns a single-page buffer. The hot path writes a value
> > via the push callback with preemption disabled (no locks).
> > - When the buffer fills, it is swapped with a fresh page from a
> > pre-allocated page pool. The full page is queued to a WQ_UNBOUND
> > worker for drain.
> > - The drain callback performs the actual expensive work (zs_free,
> > slot_free, zswap_entry_free, etc.) in batch, off the hot path.
> > - If no free page is available, the caller falls back to synchronous
> > processing.
> >
> > The speedup comes from moving expensive swap slot freeing off the
> > munmap hot path into a background worker, so that intact anonymous
> > folios are released back to the system without blocking. The worker
> > drains at a slower rate since compressed objects are small and freeing
> > a single handle may not release an entire page until the zspage is
> > fully empty.
> >
> > Performance results (Raspberry Pi 4B, ARM64, 8GB RAM):
> >
> > Test 1: munmap latency for 256MB swap-filled VMA (zram backend)
> >
> > mode Base Patched Speedup
> > single 61.82ms 8.62ms 7.17x
> > multi 2p 94.75ms 54.11ms 1.75x
> > multi 3p 154.64ms 104.83ms 1.48x
> >
> > Test 2: munmap latency for different sizes (zram, single process)
> >
> > Size Base Patched Speedup
> > 64MB 14.11ms 2.18ms 6.47x
> > 128MB 29.45ms 4.48ms 6.57x
> > 192MB 43.85ms 6.62ms 6.62x
> > 256MB 57.01ms 9.08ms 6.28x
> > 512MB 115.13ms 55.58ms 2.07x
> > 1024MB 229.66ms 153.28ms 1.50x
> >
> > Test 3: munmap latency for 256MB swap-filled VMA (zswap backend)
> >
> > mode Base Patched Speedup
> > single 152.14ms 51.26ms 2.97x
> > multi 2p 186.56ms 105.42ms 1.77x
> > multi 3p 205.83ms 153.32ms 1.34x
> >
> > Test 4: munmap latency for different sizes (zswap, single process)
> >
> > Size Base Patched Speedup
> > 64MB 37.83ms 13.26ms 2.85x
> > 128MB 75.11ms 26.73ms 2.81x
> > 256MB 150.78ms 52.97ms 2.85x
> > 512MB 303.04ms 130.38ms 2.32x
> > 1024MB 599.95ms 287.10ms 2.09x
> >
>
> Hmmm, why are we batching at the zswap/zsmalloc level like this? I
> agree with Yosry that this seems like somewhat of an unnecessary
> layering violation. For example, do we observe a lot more performance
> wins by doing this instead of just simply:
>

Thanks for the reply, refer following thread for the perf breakdown
and detail data:

https://lore.kernel.org/linux-mm/CAOptpSPY3YL5VFJW9KKP99Yb17+_rdXKsKj93FdEn3_Zb350ow@xxxxxxxxxxxxxx/

> static void zswap_entry_free(swp_entry_t swp, bool deferred)
> {
> ...
> if (!deferred || !zs_deferred_free(entry->pool->zs_pool , entry->handle))
> zs_free(entry->pool->zs_pool , entry->handle);
> }
>
> (basically what you had in the last version).
>
> One weird effect of doing deferred zswap entry freeing like what you
> are proposing here, is that the zswap LRU will be littered with stale
> zswap entries. Seems like you removed them from the zswap xarray, but
> they're still linked into the zswap LRU? At writeback time, that will
> throw off the statistics used in the heuristics, and will make
> writeback go through a bunch of stale entries, wasting more cycles :)
> Seems a bit inelegant, no?

You're right, that was an oversight -- thanks for pointing it. The
zsmalloc-only variant avoids this entirely: zswap_lru_del() stays
synchronous before the handle is queued, so the LRU never contains
torn-down entries. I'll make sure v4 doesn't have this issue
regardless of which direction we go.