Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Nhat Pham

Date: Tue May 19 2026 - 11:29:20 EST

On Sat, May 9, 2026 at 2:08 AM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> On Sat, May 9, 2026 at 7:33 AM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> >
> > On Wed, May 6, 2026 at 6:55 AM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > >
> > > On Sat, May 2, 2026 at 3:21 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
> > > >
> > > >
> > > > Oh man, you are eliminating pool lock here right? This would help my
> > > > other patch series a lot too :)
> > > >
> > > > https://lore.kernel.org/all/CAKEwX=M5YpR0cQrryX_y4pm_BuxyUWZ_8MbhWodwbf1Fe=gzew@xxxxxxxxxxxxxx/
> > > > https://lore.kernel.org/all/CAKEwX=PkFiP+u+ThrzjTKBi+usQf2uuhTZcfB2BNNA8RboOFDQ@xxxxxxxxxxxxxx/
> > > >
> > >
> > > Yes, exactly. With class_idx encoded in the obj value,
> > > zs_free() can determine the correct size_class without
> > > any pool-level lock. The lockless read gives a valid
> > > class_idx because it's invariant across migration (only
> > > PFN changes), and we re-read obj under class->lock to
> > > get a stable PFN.
> > >
> > > >
> > > > /*
> > > > * The pool->lock protects the race with zpage's migration
> > > > * so it's safe to get the page from handle.
> > > > */
> > > > read_lock(&pool->lock);
> > > > obj = handle_to_obj(handle);
> > > > obj_to_zpdesc(obj, &f_zpdesc);
> > > > zspage = get_zspage(f_zpdesc);
> > > > class = zspage_class(pool, zspage);
> > > > spin_lock(&class->lock);
> > > > read_unlock(&pool->lock);
> > > >
> > > > It's basically just this blob right?
> > > >
> > >
> > > Yes, that's the blob being replaced. On the
> > > ZS_OBJ_CLASS_IDX path (64-bit systems), it becomes:
> > >
> > > obj = handle_to_obj(handle);
> > > class = pool->size_class[obj_to_class_idx(obj)];
> > > spin_lock(&class->lock);
> > > obj = handle_to_obj(handle); /* re-read for stable PFN */
> > >
> > > No pool->lock at all. We've also added compile-time
> > > gating (#if BITS_PER_LONG >= 64) since 32-bit systems
> > > lack the spare bits in OBJ_INDEX to fit class_idx. On
> > > 32-bit, it falls back to the original pool->lock path.
> > >
> >
> > BTW, I've tested your idea with a hacky prototype, when I was playing
> > with my vswap series. It absolutely improves free time in the usemem
> > benchmark :) Idea is very promising - I won't scoop your work of
> > course, just letting you know that at least in my use case, it works
> > :) Look forward to seeing it submitted soon!!!
>
> Thanks, Nhat, that's great to hear.
>
> I've split this part out and posted it as its own series:
>
> https://lore.kernel.org/linux-mm/20260508061910.3882831-1-haowenchao@xxxxxxxxxx
>
> Review there would be very welcome.

Huh I think I might have been unsubscribed from linux-mm again -.-
Weird - I wonder if this is because of Gmail shenanigans.

Can you cc me the thread next time just in case?

>
> Also, could you share the details of your usemem setup? I'd like
> to reproduce it locally on the same baseline.

Sure! I left some notes here:

https://lore.kernel.org/all/20260505153854.1612033-1-nphamcs@xxxxxxxxx/

But for your convenience, this is the benchmark I ran:

2. Usemem single-threaded: anonymous memory allocation (56GB) on a host
with 32GB RAM, 16 rounds.

I don't put a limit on the cgroup, relying on global pressure (per
Kairui's instructions).

I'm not on my work server right now so I don't have the exact command,
but hopefully that should be enough to show the wins with your patch
series! I wanted to run it for your patch series myself but I do not
have the cycles right now, unfortunately :(

>
> Thanks,
> Wenchao