Re: [RFC PATCH v2 0/4] mm/zsmalloc: reduce zs_free() latency on swap release path

From: Yosry Ahmed

Date: Wed Apr 29 2026 - 18:44:37 EST


> > How much of the benefit do we get with just these locking improvements
> > without having to defer any of the freeing work?
> >
>
> Hi Yosry,
>
> Thanks for the review. Great question — we tested exactly this.
>
> With only the class_idx-in-obj encoding (eliminating pool->lock from
> zs_free, no deferred freeing), we measured on two platforms.
>
> Test: each process independently mmap 256MB, write data, madvise
> MADV_PAGEOUT to swap out via zram (lzo-rle), then concurrent munmap.
>
> Raspberry Pi 4B (4-core ARM64 Cortex-A72):
>
> mode Base ClassIdx-only Speedup
> single 59.0ms 56.0ms 1.05x
> multi 2p 94.6ms 66.7ms 1.42x
> multi 4p 202.9ms 110.6ms 1.83x
>
> x86 physical machine (4-core Intel i7-12700, 2 rounds averaged):
>
> mode Base ClassIdx-only Speedup
> single 11.7ms 9.8ms 1.19x
> multi 2p 24.1ms 17.2ms 1.40x
> multi 4p 63.0ms 45.3ms 1.39x
>
> Single-process shows modest improvement. With multiple processes,
> each read_lock/read_unlock atomically modifies the shared rwlock
> reader count, and the cost of these atomic operations increases
> with more CPUs accessing the same cacheline concurrently.
> Eliminating pool->lock removes this overhead entirely.
>
> This only works on 64-bit systems where OBJ_INDEX_BITS has enough
> spare bits to fit class_idx. 32-bit systems don't have the room.
> I'm still working on the compile-time gating to properly enable
> this based on architecture and page size configuration.
>
> > As others have pointed out, I don't want to just defer expensive work
> > without understanding why it's expensive and running into limitations
> > about why it cannot be improved without deferring.
>
> For the deferred freeing part: the class_idx-in-obj optimization
> addresses the multi-process scenario where concurrent atomic
> operations on pool->lock become expensive, but does not help
> single-process munmap. Deferred freeing moves the entire zs_free
> cost (including class->lock and zspage freeing) off the munmap
> hot path, which benefits even single-process workloads. The two
> optimizations are complementary.

What is the extra speedup added by the deferred freeing on top of the
locking improvements? I couldn't immediately tell by looking at this
vs. the cover letter. I wonder what portion of the improvement comes
from the deferred freeing?