Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Nhat Pham

Date: Fri May 08 2026 - 19:37:24 EST

On Wed, May 6, 2026 at 6:55 AM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > > With only the class_idx-in-obj encoding (eliminating pool->lock from
> > > zs_free, no deferred freeing), we measured on two platforms.
> > >
> > > Test: each process independently mmap 256MB, write data, madvise
> > > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
> > >
> > > Raspberry Pi 4B (4-core ARM64 Cortex-A72):
> > >
> > > mode Base ClassIdx-only Speedup
> > > single 59.0ms 56.0ms 1.05x
> > > multi 2p 94.6ms 66.7ms 1.42x
> > > multi 4p 202.9ms 110.6ms 1.83x
> > >
> > > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
> > >
> > > mode Base ClassIdx-only Speedup
> > > single 11.7ms 9.8ms 1.19x
> > > multi 2p 24.1ms 17.2ms 1.40x
> > > multi 4p 63.0ms 45.3ms 1.39x
> >
> > Oh man, you are eliminating pool lock here right? This would help my
> > other patch series a lot too :)
> >
> > https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@xxxxxxxxxxxxxx/
> > https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@xxxxxxxxxxxxxx/
> >
>
> Yes, exactly. With class_idx encoded in the obj value,
> zs_free() can determine the correct size_class without
> any pool-level lock. The lockless read gives a valid
> class_idx because it's invariant across migration (only
> PFN changes), and we re-read obj under class->lock to
> get a stable PFN.
>
> > Well, the deferred freeing would completely move that contention out
> > of the way lol. But this would benefit all users, regardless of
> > whether we're deferring the free step or not (for instance, this will
> > reduce contention between page fault and compaction, IIUC?) I feel
> > like you'll get some good numbers testing in a system with compaction
> > and THP enabled, with lots of swap activities. Which is... a lot of
> > server setup :)
> >
> > If the deferred freeing is too controversial, this smells like
> > something that should be upstreamed independently.
> >
>
> Agreed. We're planning to split the series so that the
> class_idx encoding + pool->lock elimination can be
> reviewed and merged independently of the deferred free
> framework. It's a pure win with no behavioral change
> — just less lock contention.
>
> > >
> > > Single-process shows modest improvement. With multiple processes,
> > > each read_lock/read_unlock atomically modifies the shared rwlock
> > > reader count, and the cost of these atomic operations increases
> > > with more CPUs accessing the same cacheline concurrently.
> > > Eliminating pool->lock removes this overhead entirely.
> > >
> > > This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> > > spare bits to fit class_idx. 32-bit systems don't have the room.
> > > I'm still working on the compile-time gating to properly enable
> > > this based on architecture and page size configuration.
> >
> > /*
> > * The pool->lock protects the race with zpage's migration
> > * so it's safe to get the page from handle.
> > */
> > read_lock(&pool->lock);
> > obj = handle_to_obj(handle);
> > obj_to_zpdesc(obj, &f_zpdesc);
> > zspage = get_zspage(f_zpdesc);
> > class = zspage_class(pool, zspage);
> > spin_lock(&class->lock);
> > read_unlock(&pool->lock);
> >
> > It's basically just this blob right?
> >
>
> Yes, that's the blob being replaced. On the
> ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes:
>
> obj = handle_to_obj(handle);
> class = pool->size_class[obj_to_class_idx(obj)];
> spin_lock(&class->lock);
> obj = handle_to_obj(handle); /* re-read for stable PFN */
>
> No pool->lock at all. We've also added compile-time
> gating (#if BITS_PER_LONG >= 64) since 32-bit systems
> lack the spare bits in OBJ_INDEX to fit class_idx. On
> 32-bit, it falls back to the original pool->lock path.
>

BTW, I've tested your idea with a hacky prototype, when I was playing
with my vswap series. It absolutely improves free time in the usemem
benchmark :) Idea is very promising - I won't scoop your work of
course, just letting you know that at least in my use case, it works
:) Look forward to seeing it submitted soon!!!