Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Wenchao Hao

Date: Tue Apr 28 2026 - 10:35:28 EST

On Tue, Apr 28, 2026 at 9:51 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
>
> On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
> >
> > On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > >
> > > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > > >
> > > > Swap freeing can be expensive when unmapping a VMA containing
> > > > many swap entries. This has been reported to significantly
> > > > delay memory reclamation during Android's low-memory killing,
> > > > especially when multiple processes are terminated to free
> > > > memory, with slot_free() accounting for more than 80% of
> > > > the total cost of freeing swap entries.
> > > >
> > > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > > to asynchronously collect and free swap entries [1][2], but the
> > > > design itself is fairly complex.
> > > >
> > > Hi Nhat, Kairui, Barry, Xueyuan,
> > >
> > > Thanks for the review. I agree with the direction and have some ideas for
> > > an alternative approach.
> > >
> > > My approach: first eliminate pool->lock from zs_free() itself, then defer
> > > free to per-cpu buffers with a lockless handoff, and finally reduce
> > > class->lock overhead during drain by exploiting natural class locality.
> > > Achieving both per-cpu and per-class is difficult, so the class->lock
> > > optimization is a compromise — but one that works well in practice.
> > >
> > > 1. Encode class_idx in obj to eliminate pool->lock
> > >
> > > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> > > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> > > for obj_idx, leaving 14 spare bits.
> > > We can split OBJ_INDEX into class_idx + obj_idx:
> > >
> > > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
> > >
> > > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> > > (8 bits for 4K pages, 9 for 64K).
> > > Since class_idx is invariant across migration (only PFN changes), zs_free()
> > > can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> > > stable PFN. No pool->lock needed.
> >
> > How much of the benefit do we get with just these locking improvements
> > without having to defer any of the freeing work?
> >
>
> Hi Yosry,
>
> Thanks for the review. Great question — we tested exactly this.
>
> With only the class_idx-in-obj encoding (eliminating pool->lock from
> zs_free, no deferred freeing), we measured on two platforms.
>
> Test: each process independently mmap 256MB, write data, madvise
> MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
>
> Raspberry Pi 4B (4-core ARM64 Cortex-A72):
>
> mode Base ClassIdx-only Speedup
> single 59.0ms 56.0ms 1.05x
> multi 2p 94.6ms 66.7ms 1.42x
> multi 4p 202.9ms 110.6ms 1.83x
>
> x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
>
> mode Base ClassIdx-only Speedup
> single 11.7ms 9.8ms 1.19x
> multi 2p 24.1ms 17.2ms 1.40x
> multi 4p 63.0ms 45.3ms 1.39x
>

Correction on the x86 test description: the machine is a 20-core
Intel i7-12700, not 4-core. The test only ran 4 concurrent
processes. The multi 4p result (1.39x) is with 4 out of 20 cores
active — pool->lock contention would be higher with more
concurrent processes on this machine.