Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Nhat Pham

Date: Sat May 02 2026 - 03:21:39 EST

On Tue, Apr 28, 2026 at 2:51 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > >
> > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > > >
> > > > Swap freeing can be expensive when unmapping a VMA containing
> > > > many swap entries. This has been reported to significantly
> > > > delay memory reclamation during Android's low-memory killing,
> > > > especially when multiple processes are terminated to free
> > > > memory, with slot_free() accounting for more than 80% of
> > > > the total cost of freeing swap entries.
> > > >
> > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > > to asynchronously collect and free swap entries [1][2], but the
> > > > design itself is fairly complex.
> > > >
> > > Hi Nhat, Kairui, Barry, Xueyuan,
> > >
> > > Thanks for the review. I agree with the direction and have some ideas for
> > > an alternative approach.
> > >
> > > My approach: first eliminate pool->lock from zs_free() itself, then defer
> > > free to per-cpu buffers with a lockless handoff, and finally reduce
> > > class->lock overhead during drain by exploiting natural class locality.
> > > Achieving both per-cpu and per-class is difficult, so the class->lock
> > > optimization is a compromise — but one that works well in practice.
> > >
> > > 1. Encode class_idx in obj to eliminate pool->lock
> > >
> > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> > > for obj_idx, leaving 14 spare bits.
> > > We can split OBJ_INDEX into class_idx + obj_idx:
> > >
> > > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
> > >
> > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> > > (8 bits for 4K pages, 9 for 64K).
> > > Since class_idx is invariant across migration (only PFN changes), zs_free()
> > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> > > stable PFN. No pool->lock needed.
> >
> > How much of the benefit do we get with just these locking improvements
> > without having to defer any of the freeing work?
> >
>
> Hi Yosry,
>
> Thanks for the review. Great question — we tested exactly this.
>
> With only the class_idx-in-obj encoding (eliminating pool->lock from
> zs_free, no deferred freeing), we measured on two platforms.
>
> Test: each process independently mmap 256MB, write data, madvise
> MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
>
> Raspberry Pi 4B (4-core ARM64 Cortex-A72):
>
> mode Base ClassIdx-only Speedup
> single 59.0ms 56.0ms 1.05x
> multi 2p 94.6ms 66.7ms 1.42x
> multi 4p 202.9ms 110.6ms 1.83x
>
> x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
>
> mode Base ClassIdx-only Speedup
> single 11.7ms 9.8ms 1.19x
> multi 2p 24.1ms 17.2ms 1.40x
> multi 4p 63.0ms 45.3ms 1.39x

Oh man, you are eliminating pool lock here right? This would help my
other patch series a lot too :)

https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@xxxxxxxxxxxxxx/
https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@xxxxxxxxxxxxxx/

Well, the deferred freeing would completely move that contention out
of the way lol. But this would benefit all users, regardless of
whether we're deferring the free step or not (for instance, this will
reduce contention between page fault and compaction, IIUC?) I feel
like you'll get some good numbers testing in a system with compaction
and THP enabled, with lots of swap activities. Which is... a lot of
server setup :)

If the deferred freeing is too controversial, this smells like
something that should be upstreamed independently.

>
> Single-process shows modest improvement. With multiple processes,
> each read_lock/read_unlock atomically modifies the shared rwlock
> reader count, and the cost of these atomic operations increases
> with more CPUs accessing the same cacheline concurrently.
> Eliminating pool->lock removes this overhead entirely.
>
> This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> spare bits to fit class_idx. 32-bit systems don't have the room.
> I'm still working on the compile-time gating to properly enable
> this based on architecture and page size configuration.

/*
* The pool->lock protects the race with zpage's migration
* so it's safe to get the page from handle.
*/
read_lock(&pool->lock);
obj = handle_to_obj(handle);
obj_to_zpdesc(obj, &f_zpdesc);
zspage = get_zspage(f_zpdesc);
class = zspage_class(pool, zspage);
spin_lock(&class->lock);
read_unlock(&pool->lock);

It's basically just this blob right?

>
> > As others have pointed out, I don't want to just defer expensive work
> > without understanding why it's expensive and running into limitations
> > about why it cannot be improved without deferring.
>
> For the deferred freeing part: the class_idx-in-obj optimization
> addresses the multi-process scenario where concurrent atomic
> operations on pool->lock become expensive, but does not help
> single-process munmap. Deferred freeing moves the entire zs_free
> cost (including class->lock and zspage freeing) off the munmap
> hot path, which benefits even single-process workloads. The two
> optimizations are complementary.

+1 :)