Re: [PATCH 0/3] Allow ZRAM to use any zpool-compatible backend

From: Vitaly Wool
Date: Mon Oct 21 2019 - 10:21:38 EST

Next message: David Howells: "Re: [PATCH] security/keyring: avoid pagefaults in keyring_read_iterator"
Previous message: Oleg Nesterov: "cgroup_enable_task_cg_lists && PF_EXITING (Was: KCSAN: data-race in exit_signals / prepare_signal)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Oct 15, 2019 at 10:00 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
>
> On Tue, Oct 15, 2019 at 09:39:35AM +0200, Vitaly Wool wrote:
> > Hi Minchan,
> >
> > On Mon, Oct 14, 2019 at 6:41 PM Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > >
> > > On Thu, Oct 10, 2019 at 11:04:14PM +0300, Vitaly Wool wrote:
> > > > The coming patchset is a new take on the old issue: ZRAM can currently be used only with zsmalloc even though this may not be the optimal combination for some configurations. The previous (unsuccessful) attempt dates back to 2015 [1] and is notable for the heated discussions it has caused.
> > > >
> > > > The patchset in [1] had basically the only goal of enabling ZRAM/zbud combo which had a very narrow use case. Things have changed substantially since then, and now, with z3fold used widely as a zswap backend, I, as the z3fold maintainer, am getting requests to re-interate on making it possible to use ZRAM with any zpool-compatible backend, first of all z3fold.
> > > >
> > > > The preliminary results for this work have been delivered at Linux Plumbers this year [2]. The talk at LPC, though having attracted limited interest, ended in a consensus to continue the work and pursue the goal of decoupling ZRAM from zsmalloc.
> > > >
> > > > The current patchset has been stress tested on arm64 and x86_64 devices, including the Dell laptop I'm writing this message on now, not to mention several QEmu confugirations.
> > > >
> > > > [1] https://lkml.org/lkml/2015/9/14/356
> > > > [2] https://linuxplumbersconf.org/event/4/contributions/551/
> > >
> > > Please describe what's the usecase in real world, what's the benefit zsmalloc
> > > cannot fulfill by desgin and how it's significant.
> >
> > I'm not entirely sure how to interpret the phrase "the benefit
> > zsmalloc cannot fulfill by design" but let me explain.
> > First, there are multi multi core systems where z3fold can provide
> > better throughput.
>
> Please include number in the description with workload.

Sure. So on an HMP 8-core ARM64 system with ZRAM, we run the following command:
fio --bs=4k --randrepeat=1 --randseed=100 --refill_buffers \
--buffer_compress_percentage=50 --scramble_buffers=1 \
--direct=1 --loops=15 --numjobs=4 --filename=/dev/block/zram0 \
--name=seq-write --rw=write --stonewall --name=seq-read \
--rw=read --stonewall --name=seq-readwrite --rw=rw --stonewall \
--name=rand-readwrite --rw=randrw --stonewall

The results are the following:

zsmalloc:
Run status group 0 (all jobs):
WRITE: io=61440MB, aggrb=1680.4MB/s, minb=430167KB/s,
maxb=440590KB/s, mint=35699msec, maxt=36564msec

Run status group 1 (all jobs):
READ: io=61440MB, aggrb=1620.4MB/s, minb=414817KB/s,
maxb=414850KB/s, mint=37914msec, maxt=37917msec

Run status group 2 (all jobs):
READ: io=30615MB, aggrb=897979KB/s, minb=224494KB/s,
maxb=228161KB/s, mint=34351msec, maxt=34912msec
WRITE: io=30825MB, aggrb=904110KB/s, minb=226027KB/s,
maxb=229718KB/s, mint=34351msec, maxt=34912msec

Run status group 3 (all jobs):
READ: io=30615MB, aggrb=772002KB/s, minb=193000KB/s,
maxb=193010KB/s, mint=40607msec, maxt=40609msec
WRITE: io=30825MB, aggrb=777273KB/s, minb=194318KB/s,
maxb=194327KB/s, mint=40607msec, maxt=40609msec

z3fold:
Run status group 0 (all jobs):
WRITE: io=61440MB, aggrb=1224.8MB/s, minb=313525KB/s,
maxb=329941KB/s, mint=47671msec, maxt=50167msec

Run status group 1 (all jobs):
READ: io=61440MB, aggrb=3119.3MB/s, minb=798529KB/s,
maxb=862883KB/s, mint=18228msec, maxt=19697msec

Run status group 2 (all jobs):
READ: io=30615MB, aggrb=937283KB/s, minb=234320KB/s,
maxb=234334KB/s, mint=33446msec, maxt=33448msec
WRITE: io=30825MB, aggrb=943682KB/s, minb=235920KB/s,
maxb=235934KB/s, mint=33446msec, maxt=33448msec

Run status group 3 (all jobs):
READ: io=30615MB, aggrb=829591KB/s, minb=207397KB/s,
maxb=210285KB/s, mint=37271msec, maxt=37790msec
WRITE: io=30825MB, aggrb=835255KB/s, minb=208813KB/s,
maxb=211721KB/s, mint=37271msec, maxt=37790msec

So, z3fold is faster everywhere (including being *two* times faster on
read) except for sequential write which is the least important use
case in real world.

> > Then, there are low end systems with hardware
> > compression/decompression support which don't need zsmalloc
> > sophistication and would rather use zbud with ZRAM because the
> > compression ratio is relatively low.
>
> I couldn't imagine how it's bad with zsmalloc. Could you be more
> specific?

> > Finally, there are MMU-less systems targeting IOT and still running
> > Linux and having a compressed RAM disk is something that would help
> > these systems operate in a better way (for the benefit of the overall
> > Linux ecosystem, if you care about that, of course; well, some people
> > do).
>
> Could you write down what's the problem to use zsmalloc for MMU-less
> system? Maybe, it would be important point rather other performance
> argument since other functions's overheads in the callpath are already
> rather big.

Well, I assume you had the reasons to make zsmalloc depend on MMU in Kconfig:
...
config ZSMALLOC
tristate "Memory allocator for compressed pages"
depends on MMU
help
...

But even disregarding that, let's compare ZRAM/zbud and ZRAM/zsmalloc
performance and memory these two consume on a relatively low end
2-core ARM.
Command:
fio --bs=4k --randrepeat=1 --randseed=100 --refill_buffers
--scramble_buffers=1 \
--direct=1 --loops=15 --numjobs=2 --filename=/dev/block/zram0 \
--name=seq-write --rw=write --stonewall --name=seq-read --rw=read \
--stonewall --name=seq-readwrite --rw=rw --stonewall
--name=rand-readwrite \
--rw=randrw --stonewall

zsmalloc:
Run status group 0 (all jobs):
WRITE: io=30720MB, aggrb=374763KB/s, minb=187381KB/s,
maxb=188389KB/s, mint=83490msec, maxt=83939msec

Run status group 1 (all jobs):
READ: io=30720MB, aggrb=964000KB/s, minb=482000KB/s,
maxb=482015KB/s, mint=32631msec, maxt=32632msec

Run status group 2 (all jobs):
READ: io=15308MB, aggrb=431263KB/s, minb=215631KB/s,
maxb=215898KB/s, mint=36302msec, maxt=36347msec
WRITE: io=15412MB, aggrb=434207KB/s, minb=217103KB/s,
maxb=217373KB/s, mint=36302msec, maxt=36347msec

Run status group 3 (all jobs):
READ: io=15308MB, aggrb=327328KB/s, minb=163664KB/s,
maxb=163667KB/s, mint=47887msec, maxt=47888msec
WRITE: io=15412MB, aggrb=329563KB/s, minb=164781KB/s,
maxb=164785KB/s, mint=47887msec, maxt=47888msec

zbud:
Run status group 0 (all jobs):
WRITE: io=30720MB, aggrb=735980KB/s, minb=367990KB/s,
maxb=373079KB/s, mint=42159msec, maxt=42742msec

Run status group 1 (all jobs):
READ: io=30720MB, aggrb=927915KB/s, minb=463957KB/s,
maxb=463999KB/s, mint=33898msec, maxt=33901msec

Run status group 2 (all jobs):
READ: io=15308MB, aggrb=403467KB/s, minb=201733KB/s,
maxb=202051KB/s, mint=38790msec, maxt=38851msec
WRITE: io=15412MB, aggrb=406222KB/s, minb=203111KB/s,
maxb=203430KB/s, mint=38790msec, maxt=38851msec

Run status group 3 (all jobs):
READ: io=15308MB, aggrb=334967KB/s, minb=167483KB/s,
maxb=167487KB/s, mint=46795msec, maxt=46796msec
WRITE: io=15412MB, aggrb=337254KB/s, minb=168627KB/s,
maxb=168630KB/s, mint=46795msec, maxt=46796msec

Pretty equal except for sequential write which is twice as good with zbud.

Now to the fun part.
zsmalloc:
0 .text 00002908 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
zbud:
0 .text 0000072c 0000000000000000 0000000000000000 00000040 2**2
CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE

And this does not cover dynamic memory allocation overhead which is
higher for zsmalloc. So once again, given that the compression ratio
is low (e. g. a simple HW accelerator is used), what would most
unbiased people prefer to use in this case?

> > > I really don't want to make fragmentaion of allocator so we should really see
> > > how zsmalloc cannot achieve things if you are claiming.
> >
> > I have to say that this point is completely bogus. We do not create
> > fragmentation by using a better defined and standardized API. In fact,
> > we aim to increase the number of use cases and test coverage for ZRAM.
> > With that said, I have hard time seeing how zsmalloc can operate on a
> > MMU-less system.
> >
> > > Please tell us how to test it so that we could investigate what's the root
> > > cause.
> >
> > I gather you haven't read neither the LPC documents nor my
> > conversation with Sergey re: these changes, because if you did you
> > wouldn't have had the type of questions you're asking. Please also see
> > above.
>
> Please include your claims in the description rather than attaching
> file. That's the usualy way how we work because it could make easier to
> discuss by inline.

Did I attach something? I don't quite recall that. I posted links to
previous discussions and conference materials, each for a reason.

> >
> > I feel a bit awkward explaining basic things to you but there may not
> > be other "root cause" than applicability issue. zsmalloc is a great
> > allocator but it's not universal and has its limitations. The
> > (potential) scope for ZRAM is wider than zsmalloc can provide. We are
> > *helping* _you_ to extend this scope "in real world" (c) and you come
> > up with bogus objections. Why?
>
> Please add more detail to convince so we need to think over why zsmalloc
> cannot be improved for the usecase.

This approach is wrong. zsmalloc is good enough and covers a lot of
use cases but there are still some where it doesn't work that well by
design. E. g. on an XIP system we do care about the code size since
it's stored uncompressed but still want to use ZRAM. Why would we want
to waste almost 10K just on zsmalloc code if the counterpart (zbud in
that case) works better?

Best regards,
Vitaly

Next message: David Howells: "Re: [PATCH] security/keyring: avoid pagefaults in keyring_read_iterator"
Previous message: Oleg Nesterov: "cgroup_enable_task_cg_lists && PF_EXITING (Was: KCSAN: data-race in exit_signals / prepare_signal)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]