Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Nhat Pham

Date: Tue Apr 21 2026 - 11:55:13 EST

On Tue, Apr 21, 2026 at 5:16 AM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> Swap freeing can be expensive when unmapping a VMA containing
> many swap entries. This has been reported to significantly
> delay memory reclamation during Android's low-memory killing,
> especially when multiple processes are terminated to free
> memory, with slot_free() accounting for more than 80% of
> the total cost of freeing swap entries.
>
> Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> to asynchronously collect and free swap entries [1][2], but the
> design itself is fairly complex.
>
> When anon folios and swap entries are mixed within a
> process, reclaiming anon folios from killed processes
> helps return memory to the system as quickly as possible,
> so that newly launched applications can satisfy their
> memory demands. It is not ideal for swap freeing to block
> anon folio freeing. On the other hand, swap freeing can
> still return memory to the system, although at a slower
> rate due to memory compression.

Is this correct? I don't think we do decompression in
zswap_invalidate() path. We do decompression in zswap_load(), but as a
separate step from zswap_invalidate().

zswap/zsmalloc entry freeing is decoupled from decompression. For
example, on process teardown, we free the zsmalloc memory but never
decompress (if we do then it's a bug to be fixed lol, but I doubt it).

Zsmalloc freeing might not be worth as much bang-for-your-buck wise
compared to anon folio freeing, but if it's "expensive", then I think
that points to a different root-cause: zsmalloc's poor scalability in
the free path.

I've stared at this code path for a bit, because my other patch series
(vswap - see [1]) was reported to display regression on the free path
on the usemem benchmark. And one of the issues was the contention
between compaction (both systemwide compaction, i.e zs_page_migrate,
and zsmalloc's internal compaction, but mostly the former).:

* zs_free read-acquires pool->lock, and compaction write-acquires the
same lock. So the compaction thread will make all zs free-ers wait for
it. I saw this read lock delay when I perfed the free step of usemem.

* If this lock has fair queue-ing semantics (I have not checked), then
if there a compaction is behind a bunch of zs_free in the queue, then
all the subsequent zs_free's ers are blocked :)

* I'm also curious about cache-friendliness of this rwlock, bouncing
across CPUs, if you have multiple processes being torn down
concurrently.

Have you perf-ed process teardown yet? Can I ask you for a perf trace
on this part? I'm not against async zs-freeing (might still be
required after all), but if it's something fixable on the zsmalloc
side, we should probably prioritize that :) Otherwise these swap
freeing workers will exhibit the same poor scalability behavior - we
might be better off because we manage to get rid of bigger chunks of
uncompressed memory first, but we will still be slowed in releasing
the system's and cgroup's (in zswap's case) compressed memory

I'd love to hear more about thoughts from Yosry, Johannes, Sergey and
Minchan too.