Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Wenchao Hao

Date: Thu Apr 30 2026 - 03:39:54 EST

On Thu, Apr 30, 2026 at 6:44 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> > > How much of the benefit do we get with just these locking improvements
> > > without having to defer any of the freeing work?
> > >
> >
> > Hi Yosry,
> >
> > Thanks for the review. Great question — we tested exactly this.
> >
> > With only the class_idx-in-obj encoding (eliminating pool->lock from
> > zs_free, no deferred freeing), we measured on two platforms.
> >
> > Test: each process independently mmap 256MB, write data, madvise
> > MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
> >
> > Raspberry Pi 4B (4-core ARM64 Cortex-A72):
> >
> > mode Base ClassIdx-only Speedup
> > single 59.0ms 56.0ms 1.05x
> > multi 2p 94.6ms 66.7ms 1.42x
> > multi 4p 202.9ms 110.6ms 1.83x
> >
> > x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
> >
> > mode Base ClassIdx-only Speedup
> > single 11.7ms 9.8ms 1.19x
> > multi 2p 24.1ms 17.2ms 1.40x
> > multi 4p 63.0ms 45.3ms 1.39x
> >
> > Single-process shows modest improvement. With multiple processes,
> > each read_lock/read_unlock atomically modifies the shared rwlock
> > reader count, and the cost of these atomic operations increases
> > with more CPUs accessing the same cacheline concurrently.
> > Eliminating pool->lock removes this overhead entirely.
> >
> > This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> > spare bits to fit class_idx. 32-bit systems don't have the room.
> > I'm still working on the compile-time gating to properly enable
> > this based on architecture and page size configuration.
> >
> > > As others have pointed out, I don't want to just defer expensive work
> > > without understanding why it's expensive and running into limitations
> > > about why it cannot be improved without deferring.
> >
> > For the deferred freeing part: the class_idx-in-obj optimization
> > addresses the multi-process scenario where concurrent atomic
> > operations on pool->lock become expensive, but does not help
> > single-process munmap. Deferred freeing moves the entire zs_free
> > cost (including class->lock and zspage freeing) off the munmap
> > hot path, which benefits even single-process workloads. The two
> > optimizations are complementary.
>
> What is the extra speedup added by the deferred freeing
> on top of the locking improvements?

The data I shared earlier was class_idx-in-obj only — no
deferred freeing at all.

> I couldn't immediately tell by looking at this vs. the cover letter. I wonder
> what portion of the improvement comes from the deferred freeing?

On top of that, we added deferred freeing in the zsmalloc
layer (per-cpu page-pool based buffer swap + WQ_UNBOUND
drain worker). With both class_idx + deferred:

Test 1: concurrent munmap (256MB/process, RPi 4B):

mode Base Deferred Speedup
single 56.2ms 17.2ms 3.27x
multi 3p 153.2ms 51.5ms 2.97x

Test 2: single process munmap (various sizes):

size Base Deferred Speedup
64MB 15.0ms 4.3ms 3.47x
128MB 28.7ms 8.5ms 3.37x
192MB 43.2ms 13.0ms 3.32x
256MB 57.0ms 17.3ms 3.30x
512MB 114.4ms 38.5ms 2.97x

However, this is not the ceiling. Profiling with perf
shows that after deferred zs_free, zram_slot_free_notify
still accounts for ~65% of munmap time — mostly
slot_trylock/unlock and slot metadata operations.

To understand the theoretical limit, I tested an extreme
version that removes slot_trylock from the hot path
entirely (not safe for production, just benchmarking):

size Base Deferred No-lock Speedup
64MB 15.0ms 4.3ms 2.3ms 6.50x
128MB 28.7ms 8.5ms 4.7ms 6.14x
192MB 43.2ms 13.0ms 6.8ms 6.31x
256MB 57.0ms 17.3ms 9.0ms 6.30x
512MB 114.4ms 38.5ms 33.0ms 3.46x

I'm exploring ways to further reduce or eliminate the lock
from this path, any suggestions on how to approach this
would be appreciated.

Unless otherwise noted, all data is from Raspberry Pi 4B
(4-core ARM64 Cortex-A72, 8GB RAM, zram 2GB, lzo-rle).
Test: mmap + fill + madvise(MADV_PAGEOUT) to swap out
via zram, then measure munmap time.

Thanks,
Wenchao