Re: [PATCH v2] mm: add zblock allocator
From: Vitaly Wool
Date: Tue Apr 08 2025 - 19:12:54 EST
So zstd results in nearly double the compression ratio, which in turn
cuts total execution time *almost in half*.
The numbers speak for themselves. Compression efficiency >>> allocator
speed, because compression efficiency ultimately drives the continuous
*rate* at which allocations need to occur. You're trying to optimize a
constant coefficient at the expense of a higher-order one, which is a
losing proposition.
Well, not really. This is an isolated use case with
a. significant computing power under the hood
b. relatively few cores
c. relatively short test
d. 4K pages
If any of these isn't true, zblock dominates.
!a => zstd is too slow
!b => parallelization gives more effect
!c => zsmalloc starts losing due to having to deal with internal
fragmentation
!d => compression efficiency of zblock is better.
Even !d alone makes zblock a better choice for ARM64 based servers.
~Vitaly
Could you expand on each point? And do you have data to show this?
For b, we run zswap + zsmalloc on hosts with hundreds of cores, and
have not found zsmalloc to be a noticeable bottleneck yet, FWIW.
I don't have the numbers at hand, I think Igor will be able to provide
those tomorrow.
For c - in longer runs, how does zblock perform better than zsmalloc?
In fact, my understanding is that zsmalloc does compaction, which
should help with internal fragmentation over time. zblock doesn't seem
to do this, or maybe I missed it?
The thing is, zblock doesn't have to. Imagine a street with cars parked
at side. If you have cars of different lengths which drive in and out,
you'll end up with spaces in between that longer cars won't be able to
squeeze in to. This is why zsmalloc does compaction.
Now for zblock you can say that only same length cars are allowed to
park on one street and therefore that street is either full or you will
have a place.
For d too. I see that you hard code special configurations for zblock
blocks in the case of 0x4000 page size, but how does that help with
compression efficiency?
Well, to be able to answer that I need to dig more into zsmalloc
operation, but i would guess that zsmalloc's chunks are just multiplied
by 4 in case of 16K page and thus you lose all the granularity you used
to have, but I'm not completely certain.
Meanwhile I did a quick measurement run with zblock and zsmalloc on a
Raspberry Pi 5 (native kernel build test) with zstd as the compression
backend and the results are the following:
1. zsmalloc
*** The build was OOM killed ***
real 26m58.876s
user 95m32.425s
sys 4m39.017s
Zswap: 250944 kB
Zswapped: 871536 kB
zswpin 108
zswpout 54473
663296 /mnt/tmp/build/
2. zblock
real 27m31.579s
user 96m42.845s
sys 4m40.464s
Zswap: 66592 kB
Zswapped: 563168 kB
zswpin 243
zswpout 35262
1423200 /mnt/tmp/build/
You can see by the size of the build folder that the first run was
terminated prematurely not at all close to the end of it.
So, I can re-run the tests on 8-core high performance ARM64 with 16K
pages tomorrow, but so far everything we have seen points in one
direction: zblock is clearly superior to zsmalloc in 16K page configuration.
Besides, zblock can do even better if we extend that very hardcoded
table you mentioned (and BTW, it can be automatically generated at init
but I don't see the point in that).
~Vitaly