Re: [RFC PATCH v3 0/4] mm/zsmalloc: per-cpu deferred free to accelerate swap entry release

From: Yosry Ahmed

Date: Mon May 11 2026 - 20:01:35 EST


On Sat, May 09, 2026 at 04:32:04PM +0800, Wenchao Hao wrote:
> On Sat, May 9, 2026 at 4:13 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > On Thu, May 7, 2026 at 11:08 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing many swap
> > > entries. This has been reported to significantly delay memory reclamation
> > > during Android's low-memory killing, especially when multiple processes
> > > are terminated to free memory, with slot_free() accounting for more than
> > > 80% of the total cost of freeing swap entries.
> > >
> > > This series introduces a callback-based deferred free framework in
> > > zsmalloc. Callers (zram, zswap) register push/drain callbacks to
> > > define what gets buffered and how it gets drained. The entire free
> > > path including caller-side bookkeeping (slot_free, zswap_entry_free)
> > > is deferred to a background worker.
> >
> > How much of the speedup comes from avoiding the per-class lock,
> > free_zspage(), other work in zswap, etc.
>
> This series doesn't avoid the per-class lock. The pool->lock part
> has been split out and posted as a separate series, so this series
> focuses purely on the defer scheme:
>
> https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xxxxxxxxxx/
>
> >
> > I ask because I think the design here is still fairly complex. I don't
> > like how zswap and zram are registering callbacks into zsmalloc to do
> > their own freeing work, and they fill the buffers on behalf of
> > zsmalloc which seems like a layering violation.
>
> The callback design was motivated by code reuse -- deferring only
> zs_free() inside zsmalloc gave less speedup, and the machinery
> needed to defer caller-side bookkeeping turns out to be the same
> on both sides (per-cpu page buffer, drain worker, fallback). So I
> folded the common parts into zsmalloc.
>
> I agree it's not clean from a layering standpoint, and I'm happy to
> revisit if the reuse isn't worth the cost.
>
> >
> > I wonder how much of the speedup we get by just deferring
> > free_zspage()?
>
> Below is the perf breakdown, sampled only during munmap() of a
> 256MB zram-filled VMA on a Raspberry Pi 4B.
>
> Base kernel:
>
> # Samples: 491 of event 'cycles'
> # Event count (approx.): 214056923
> #
> # Children Self Symbol
> # ........ ........ ..........................................
> 99.55% 0.41% [k] __zap_vma_range
> 97.27% 2.91% [k] swap_put_entries_cluster
> 94.37% 1.65% [k] __swap_cluster_free_entries
> 88.99% 8.91% [k] zram_slot_free_notify
> 79.87% 10.78% [k] slot_free
> 56.27% 5.99% [k] zs_free
> 47.61% 4.35% [k] free_zspage

Seems like most of the zsmalloc overhead comres from free_zspage(),
right? I think we significantly simplify things if we only defer that
part. Instead of having a page pool and buffers were we stores the
handles for async free, we can just remove the zspage from from the
fullness list and put it on a deferred freeing list.

We can probably even explore not doing per-CPU and just use a single
global worker with a single lockless list (llist), then the worker can
just do llist_del_all() to atomically empty the list and process it
locally. If that turns out to be expensive we can do per-CPU lists.

WDYT? I think this can simplify things significantly.

> 36.85% 4.96% [k] __free_zspage
> 19.27% 0.21% [k] __folio_put
> 12.64% 2.91% [k] __free_frozen_pages
> 9.50% 6.40% [k] kmem_cache_free
> 8.28% 8.28% [k] _raw_spin_unlock_irqrestore
> 6.83% 1.85% [k] dec_zone_page_state
> 5.18% 5.18% [k] _raw_spin_unlock
> 5.18% 5.18% [k] folio_unlock
> 4.98% 4.98% [k] mod_zone_state
> 4.12% 4.12% [k] _raw_spin_lock
> 3.30% 3.30% [k] __swap_cgroup_id_xchg
>
> Perf of the zsmalloc-only variant (same 256MB zram workload):
>
> My first attempt for this RFC was exactly that -- defer only the
> handle free inside zsmalloc, keep zram/zswap caller-side bookkeeping
> synchronous. (I would post this version after this thread)
[..]