Re: [PATCH v2] mm: add zblock allocator

From: Vitaly Wool
Date: Tue Apr 08 2025 - 19:12:54 EST



So zstd results in nearly double the compression ratio, which in turn
cuts total execution time *almost in half*.

The numbers speak for themselves. Compression efficiency >>> allocator
speed, because compression efficiency ultimately drives the continuous
*rate* at which allocations need to occur. You're trying to optimize a
constant coefficient at the expense of a higher-order one, which is a
losing proposition.

Well, not really. This is an isolated use case with
a. significant computing power under the hood
b. relatively few cores
c. relatively short test
d. 4K pages

If any of these isn't true, zblock dominates.
!a => zstd is too slow
!b => parallelization gives more effect
!c => zsmalloc starts losing due to having to deal with internal
fragmentation
!d => compression efficiency of zblock is better.

Even !d alone makes zblock a better choice for ARM64 based servers.

~Vitaly

Could you expand on each point? And do you have data to show this?

For b, we run zswap + zsmalloc on hosts with hundreds of cores, and
have not found zsmalloc to be a noticeable bottleneck yet, FWIW.

I don't have the numbers at hand, I think Igor will be able to provide those tomorrow.

For c - in longer runs, how does zblock perform better than zsmalloc?
In fact, my understanding is that zsmalloc does compaction, which
should help with internal fragmentation over time. zblock doesn't seem
to do this, or maybe I missed it?

The thing is, zblock doesn't have to. Imagine a street with cars parked at side. If you have cars of different lengths which drive in and out, you'll end up with spaces in between that longer cars won't be able to squeeze in to. This is why zsmalloc does compaction.

Now for zblock you can say that only same length cars are allowed to park on one street and therefore that street is either full or you will have a place.

For d too. I see that you hard code special configurations for zblock
blocks in the case of 0x4000 page size, but how does that help with
compression efficiency?

Well, to be able to answer that I need to dig more into zsmalloc operation, but i would guess that zsmalloc's chunks are just multiplied by 4 in case of 16K page and thus you lose all the granularity you used to have, but I'm not completely certain.

Meanwhile I did a quick measurement run with zblock and zsmalloc on a Raspberry Pi 5 (native kernel build test) with zstd as the compression backend and the results are the following:

1. zsmalloc
*** The build was OOM killed ***
real 26m58.876s
user 95m32.425s
sys 4m39.017s
Zswap: 250944 kB
Zswapped: 871536 kB
zswpin 108
zswpout 54473
663296 /mnt/tmp/build/

2. zblock
real 27m31.579s
user 96m42.845s
sys 4m40.464s
Zswap: 66592 kB
Zswapped: 563168 kB
zswpin 243
zswpout 35262
1423200 /mnt/tmp/build/

You can see by the size of the build folder that the first run was terminated prematurely not at all close to the end of it.

So, I can re-run the tests on 8-core high performance ARM64 with 16K pages tomorrow, but so far everything we have seen points in one direction: zblock is clearly superior to zsmalloc in 16K page configuration.

Besides, zblock can do even better if we extend that very hardcoded table you mentioned (and BTW, it can be automatically generated at init but I don't see the point in that).

~Vitaly