Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Wenchao Hao

Date: Thu Apr 30 2026 - 11:30:27 EST

On Thu, Apr 30, 2026 at 4:00 PM Kairui Song <ryncsn@xxxxxxxxx> wrote:
>
> On Thu, Apr 30, 2026 at 3:43 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > The data I shared earlier was class_idx-in-obj only — no
> > deferred freeing at all.
> >
> > > I couldn't immediately tell by looking at this vs. the cover letter. I wonder
> > > what portion of the improvement comes from the deferred freeing?
> >
> > On top of that, we added deferred freeing in the zsmalloc
> > layer (per-cpu page-pool based buffer swap + WQ_UNBOUND
> > drain worker). With both class_idx + deferred:
> >
> > Test 1: concurrent munmap (256MB/process, RPi 4B):
> >
> > mode Base Deferred Speedup
> > single 56.2ms 17.2ms 3.27x
> > multi 3p 153.2ms 51.5ms 2.97x
> >
> > Test 2: single process munmap (various sizes):
> >
> > size Base Deferred Speedup
> > 64MB 15.0ms 4.3ms 3.47x
> > 128MB 28.7ms 8.5ms 3.37x
> > 192MB 43.2ms 13.0ms 3.32x
> > 256MB 57.0ms 17.3ms 3.30x
> > 512MB 114.4ms 38.5ms 2.97x
>

Hi Kairui,

> One concern here is that the total amount of work is
> unchanged. But when under pressure these workers could
> be a larger burden.

The total CPU work is actually slightly reduced — the
batch drain eliminates pool->lock entirely, and holds
class->lock across consecutive same-class handles rather
than acquiring/releasing per handle. So the deferred
path does less lock work than synchronous per-handle
zs_free. I'm also exploring further reductions, such as
merging zram flags operations in the notify path (as you
suggested earlier) and reducing lock overhead. Suggestions
are welcome.

The key win is not reducing work but unblocking anon
folio freeing. Each folio free returns a full page
immediately, whereas zs_free may need many handle frees
before a zspage becomes empty (multiple compressed
objects share the same zspage). By not blocking folio
freeing with expensive zs_free, we improve the rate at
which usable memory returns to the system.

With parallelism (munmap + worker on different CPUs),
the process exits faster and memory is returned sooner.
For example, what used to take ~1s on one CPU can now
complete in ~400ms across two CPUs. Under memory
pressure, spending a bit more CPU to release memory
faster is a reasonable tradeoff.

> Is it possible for you to measure that part too?

Sure. Could you describe the specific scenario you're
concerned about — CPU contention, memory pressure, or
scheduling latency? I'm happy to design and run a test
around it.

Thanks,
Wenchao