Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Wenchao Hao

Date: Wed May 06 2026 - 09:58:29 EST

On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>
> > With only the class_idx-in-obj encoding (eliminating pool->lock from
> > zs_free, no deferred freeing), we measured on two platforms.
> >
> > Test: each process independently mmap 256MB, write data, madvise
> > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
> >
> > Raspberry Pi 4B (4-core ARM64 Cortex-A72):
> >
> > mode Base ClassIdx-only Speedup
> > single 59.0ms 56.0ms 1.05x
> > multi 2p 94.6ms 66.7ms 1.42x
> > multi 4p 202.9ms 110.6ms 1.83x
> >
> > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
> >
> > mode Base ClassIdx-only Speedup
> > single 11.7ms 9.8ms 1.19x
> > multi 2p 24.1ms 17.2ms 1.40x
> > multi 4p 63.0ms 45.3ms 1.39x
>
> Oh man, you are eliminating pool lock here right? This would help my
> other patch series a lot too :)
>
> https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@xxxxxxxxxxxxxx/
> https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@xxxxxxxxxxxxxx/
>

Yes, exactly. With class_idx encoded in the obj value,
zs_free() can determine the correct size_class without
any pool-level lock. The lockless read gives a valid
class_idx because it's invariant across migration (only
PFN changes), and we re-read obj under class->lock to
get a stable PFN.

> Well, the deferred freeing would completely move that contention out
> of the way lol. But this would benefit all users, regardless of
> whether we're deferring the free step or not (for instance, this will
> reduce contention between page fault and compaction, IIUC?) I feel
> like you'll get some good numbers testing in a system with compaction
> and THP enabled, with lots of swap activities. Which is... a lot of
> server setup :)
>
> If the deferred freeing is too controversial, this smells like
> something that should be upstreamed independently.
>

Agreed. We're planning to split the series so that the
class_idx encoding + pool->lock elimination can be
reviewed and merged independently of the deferred free
framework. It's a pure win with no behavioral change
— just less lock contention.

> >
> > Single-process shows modest improvement. With multiple processes,
> > each read_lock/read_unlock atomically modifies the shared rwlock
> > reader count, and the cost of these atomic operations increases
> > with more CPUs accessing the same cacheline concurrently.
> > Eliminating pool->lock removes this overhead entirely.
> >
> > This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> > spare bits to fit class_idx. 32-bit systems don't have the room.
> > I'm still working on the compile-time gating to properly enable
> > this based on architecture and page size configuration.
>
> /*
> * The pool->lock protects the race with zpage's migration
> * so it's safe to get the page from handle.
> */
> read_lock(&pool->lock);
> obj = handle_to_obj(handle);
> obj_to_zpdesc(obj, &f_zpdesc);
> zspage = get_zspage(f_zpdesc);
> class = zspage_class(pool, zspage);
> spin_lock(&class->lock);
> read_unlock(&pool->lock);
>
> It's basically just this blob right?
>

Yes, that's the blob being replaced. On the
ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes:

obj = handle_to_obj(handle);
class = pool->size_class[obj_to_class_idx(obj)];
spin_lock(&class->lock);
obj = handle_to_obj(handle); /* re-read for stable PFN */

No pool->lock at all. We've also added compile-time
gating (#if BITS_PER_LONG >= 64) since 32-bit systems
lack the spare bits in OBJ_INDEX to fit class_idx. On
32-bit, it falls back to the original pool->lock path.

> >
> > > As others have pointed out, I don't want to just defer expensive work
> > > without understanding why it's expensive and running into limitations
> > > about why it cannot be improved without deferring.
> >
> > For the deferred freeing part: the class_idx-in-obj optimization
> > addresses the multi-process scenario where concurrent atomic
> > operations on pool->lock become expensive, but does not help
> > single-process munmap. Deferred freeing moves the entire zs_free
> > cost (including class->lock and zspage freeing) off the munmap
> > hot path, which benefits even single-process workloads. The two
> > optimizations are complementary.
>
> +1 :)