Re: [PATCH v4] mm: add zblock allocator

From: Vitaly Wool
Date: Thu May 01 2025 - 08:42:53 EST

Next message: Marc Zyngier: "Re: [PATCH v4 2/5] arm64: KVM: use mutex_trylock_nest_lock when locking all vCPUs"
Previous message: Xavier: "Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations"
Next in thread: Sergey Senozhatsky: "Re: [PATCH v4] mm: add zblock allocator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Yosry,

On 4/30/25 14:27, Yosry Ahmed wrote:

On Wed, Apr 23, 2025 at 09:53:48PM +0200, Vitaly Wool wrote:

On 4/22/25 12:46, Yosry Ahmed wrote:

I didn't look too closely but I generally agree that we should improve
zsmalloc where possible rather than add a new allocator. We are trying
not to repeat the zbud/z3fold or slub/slob stories here. Zsmalloc is
getting a lot of mileage from both zswap and zram, and is more-or-less
battle-tested. Let's work toward building upon that instead of starting
over.

The thing here is, zblock is using a very different approach to small object
allocation. The idea is: we have an array of descriptors which correspond to
multi-page blocks divided in chunks of equal size (block_size[i]). For each
object of size x we find the descriptor n such as:
block_size[n-1] < n < block_size[n]
and then we store that object in an empty slot in one of the blocks. Thus,
the density is high, the search is fast (rbtree based) and there are no
objects spanning over 2 pages, so no extra memcpy involved.

The block sizes seem to be similar in principle to class sizes in
zsmalloc. It seems to me that there are two apparent differentiating
properties to zblock:

- Block lookup uses an rbtree, so it's faster than zsmalloc's list
iteration. On the other hand, zsmalloc divides each class into
fullness groups and tries to pack almost full groups first. Not sure
if zblock's approach is strictly better.

If we free a slot in a fully packed block we put it on top of the list. zswap's normal operation pattern is that there will be more free slots in that block so it's roughly the same.

- Zblock uses higher order allocations vs. zsmalloc always using order-0
allocations. I think this may be the main advantage and I remember
asking if zsmalloc can support this. Always using order-0 pages is
more reliable but may not always be the best choice.

There's a patch we'll be posting soon with "opportunistic" high order allocations (i. e. if try_alloc_pages fails, allocate order-0 pages instead). This will leverage the benefits of higher order allocations without putting too much stress on the system.

On the other hand, zblock is lacking in other regards. For example:
- The lack of compaction means that certain workloads will see a lot of
fragmentation. It purely depends on the access patterns. We could end
up with a lot of blocks each containing a single object and there is
no way to recover AFAICT.

We have been giving many variants of stress load on the memory subsystem and the worst compression ratio *after* the stress load was 2.8x using zstd as the compressor (and about 4x under load). With zsmalloc under the same conditions the ratio was 3.6x after and 4x under load.

With more normal (but still stressing) usage patterns the numbers *after* the stress load were around 3.8x and 4.1x, respectively.

Bottom line, ending up with a lot of blocks each containing a single object is not a real life scenario. With that said, we have a quite simple solution in the making that will get zblock on par with zsmalloc even in the cases described above.

- Zblock will fail if a high order allocation cannot be satisfied, which
is more likely to happen under memory pressure, and it's usually when
zblock is needed in the first place.

See above, this issue will be addressed in the patch coming in a really short while.

- There's probably more, I didn't check too closely, and I am hoping
that Minchan and Sergey will chime in here.

And with the latest zblock, we see that it has a clear advantage in
performance over zsmalloc, retaining roughly the same allocation density for
4K pages and scoring better on 16K pages. E. g. on a kernel compilation:

* zsmalloc/zstd/make -j32 bzImage
real 8m0.594s
user 39m37.783s
sys 8m24.262s
Zswap: 200600 kB <-- after build completion
Zswapped: 854072 kB <-- after build completion
zswpin 309774
zswpout 1538332

* zblock/zstd/make -j32 bzImage
real 7m35.546s
user 38m03.475s
sys 7m47.407s
Zswap: 250940 kB <-- after build completion
Zswapped: 870660 kB <-- after build completion
zswpin 248606
zswpout 1277319

So what we see here is that zblock is definitely faster and at least not
worse with regard to allocation density under heavy load. It has slightly
worse _idle_ allocation density but since it will quickly catch up under
load it is not really important. What is important is that its
characteristics don't deteriorate over time. Overall, zblock is simple and
efficient and there is /raison d'etre/ for it.

Zblock is performing better for this specific workload, but as I
mentioned earlier there are other aspects that zblock is missing.
Zsmalloc has seen a very large range of workloads of different types,
and we cannot just dismiss this.

We've been running many different work loads with both allocators but posting all the results in the patch description will go well beyond the purpose of a patch submission. If there are some workloads you are interested in in particular, please let me know, odds are high we have some results for those too.

Now, it is indeed possible to partially rework zsmalloc using zblock's
algorithm but this will be a rather substantial change, equal or bigger in
effort to implementing the approach described above from scratch (and this
is what we did), and with such drastic changes most of the testing that has
been done with zsmalloc would be invalidated, and we'll be out in the wild
anyway. So even though I see your point, I don't think it applies in this
particular case.

Well, we should start by breaking down the differences and finding out
why zblock is performing better, as I mentioned above. If it's the
faster lookups or higher order allocations, we can work to support that
in zsmalloc. Similarly, if zsmalloc has unnecessary complexity it'd be
great to get rid of it rather than starting over.

Also, we don't have to do it all at once and invalidate the testing that
zsmalloc has seen. These can be incremental changes that get spread over
multiple releases, getting incremental exposure in the process.

I believe we are a lot closer now to having a zblock without the initial drawbacks you have pointed out than a faster zsmalloc, retaining the code simplicity of the former.

~Vitaly

Next message: Marc Zyngier: "Re: [PATCH v4 2/5] arm64: KVM: use mutex_trylock_nest_lock when locking all vCPUs"
Previous message: Xavier: "Re: [mm/contpte v3 1/1] mm/contpte: Optimize loop to reduce redundant operations"
Next in thread: Sergey Senozhatsky: "Re: [PATCH v4] mm: add zblock allocator"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]