Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Wenchao Hao

Date: Tue Apr 28 2026 - 10:18:03 EST

On Tue, Apr 28, 2026 at 2:17 AM Yosry Ahmed <yosry@xxxxxxxxxx> wrote:
>
> On Sat, Apr 25, 2026 at 9:13 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> >
> > On Tue, Apr 21, 2026 at 8:16 PM Wenchao Hao <haowenchao22@xxxxxxxxx> wrote:
> > >
> > > Swap freeing can be expensive when unmapping a VMA containing
> > > many swap entries. This has been reported to significantly
> > > delay memory reclamation during Android's low-memory killing,
> > > especially when multiple processes are terminated to free
> > > memory, with slot_free() accounting for more than 80% of
> > > the total cost of freeing swap entries.
> > >
> > > Two earlier attempts by Lei and Zhiguo added a new thread in the mm core
> > > to asynchronously collect and free swap entries [1][2], but the
> > > design itself is fairly complex.
> > >
> > Hi Nhat, Kairui, Barry, Xueyuan,
> >
> > Thanks for the review. I agree with the direction and have some ideas for
> > an alternative approach.
> >
> > My approach: first eliminate pool->lock from zs_free() itself, then defer
> > free to per-cpu buffers with a lockless handoff, and finally reduce
> > class->lock overhead during drain by exploiting natural class locality.
> > Achieving both per-cpu and per-class is difficult, so the class->lock
> > optimization is a compromise — but one that works well in practice.
> >
> > 1. Encode class_idx in obj to eliminate pool->lock
> >
> > OBJ_INDEX_BITS is over-provisioned on 64-bit. For example on arm64
> > (chain_size=8): OBJ_INDEX_BITS=24 but only 10 bits are actually needed
> > for obj_idx, leaving 14 spare bits.
> > We can split OBJ_INDEX into class_idx + obj_idx:
> >
> > obj: [PFN | class_idx (OBJ_CLASS_BITS) | obj_idx (OBJ_IDX_BITS)]
> >
> > OBJ_CLASS_BITS is computed dynamically as `ilog2(ZS_SIZE_CLASSES - 1) + 1`
> > (8 bits for 4K pages, 9 for 64K).
> > Since class_idx is invariant across migration (only PFN changes), zs_free()
> > can extract class_idx locklessly, then acquire class->lock and re-read obj for a
> > stable PFN. No pool->lock needed.
>
> How much of the benefit do we get with just these locking improvements
> without having to defer any of the freeing work?
>

Hi Yosry,

Thanks for the review. Great question — we tested exactly this.

With only the class_idx-in-obj encoding (eliminating pool->lock from
zs_free, no deferred freeing), we measured on two platforms.

Test: each process independently mmap 256MB, write data, madvise
MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.

Raspberry Pi 4B (4-core ARM64 Cortex-A72):

mode Base ClassIdx-only Speedup
single 59.0ms 56.0ms 1.05x
multi 2p 94.6ms 66.7ms 1.42x
multi 4p 202.9ms 110.6ms 1.83x

x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):

mode Base ClassIdx-only Speedup
single 11.7ms 9.8ms 1.19x
multi 2p 24.1ms 17.2ms 1.40x
multi 4p 63.0ms 45.3ms 1.39x

Single-process shows modest improvement. With multiple processes,
each read_lock/read_unlock atomically modifies the shared rwlock
reader count, and the cost of these atomic operations increases
with more CPUs accessing the same cacheline concurrently.
Eliminating pool->lock removes this overhead entirely.

This only works on 64-bit systems where OBJ_INDEX_BITS has enough
spare bits to fit class_idx. 32-bit systems don't have the room.
I'm still working on the compile-time gating to properly enable
this based on architecture and page size configuration.

> As others have pointed out, I don't want to just defer expensive work
> without understanding why it's expensive and running into limitations
> about why it cannot be improved without deferring.

For the deferred freeing part: the class_idx-in-obj optimization
addresses the multi-process scenario where concurrent atomic
operations on pool->lock become expensive, but does not help
single-process munmap. Deferred freeing moves the entire zs_free
cost (including class->lock and zspage freeing) off the munmap
hot path, which benefits even single-process workloads. The two
optimizations are complementary.

Thanks,
Wenchao