Re: [PATCH v3 3/8] lib: add zstd support to decompress
From: Nick Terrell
Date: Tue Mar 31 2020 - 22:46:58 EST
> On Mar 26, 2020, at 1:16 PM, Petr Malat <oss@xxxxxxxxx> wrote:
>
> Hi!
> On Thu, Mar 26, 2020 at 07:03:54PM +0000, Nick Terrell wrote:
>>>> * Add unzstd() and the zstd decompress interface.
>>> Here I do not understand why you limit the window size to 8MB even when
>>> you read a larger value from the header. I do not see a reason why there
>>> should be such a limitation at the first place and if there should be,
>>> why it differs from ZSTD_WINDOWLOG_MAX.
>>
>> When we are doing streaming decompression (either flush or fill is provided)
>> we have to allocate memory proportional to the window size. We want to
>> bound that memory so we don't accidentally allocate too much memory.
>> When we are doing a single-pass decompression (neither flush nor fill
>> are provided) the window size doesn't matter, and we only have to allocate
>> a fixed amount of memory ~192 KB.
>>
>> The zstd spec [0] specifies that all decoders should allow window sizes
>> up to 8 MB. Additionally, the zstd CLI won't produce window sizes greater
>> than 8 MB by default. The window size is controlled by the compression
>> level, and can be explicitly set.
> Yes, one needs to pass --ultra option to zstd to produce an incompatible
> archive, but that doesn't justify the reason to limit this in the kernel,
> especially if one is able to read the needed window size from the header
> when allocating the memory. At the time when initramfs is extracted,
> there usually is memory available as it's before any processes are
> started and this memory is reclaimed after the decompression.
>
> If, on the other hand, an user makes an initramfs for a memory constrained
> system, he limits the window size while compressing the archive and
> the small window size will be announced in the header.
>
> The only scenario where using the hard-coded limit makes sense is in a
> case the window size is not available (I'm not sure if it's mandatory
> to provide it). That's how my code works - if the size is available,
> it uses the provided value, if not it uses 1 << ZSTD_WINDOWLOG_MAX.
>
> I would also agree a fixed limit would make a sense if a user (or network)
> provided data would be used, but in this case only the system owner is able
> to provide an initramfs. If one is able to change initramfs, he can render
> the system unusable simply by providing a corrupted file. He doesn't have
> to bother making the window bigger than the available memory.
>
>> I would expect larger window sizes to be beneficial for compression ratio,
>> though there is demising returns. I would expect that for kernel image
>> compression larger window sizes are beneficial, since it is decompressed
>> with a single pass. For initramfs decompression, I would expect that limiting
>> the window size could help decompression speed, since it uses streaming
>> compression, so unzstd() has to allocate a buffer of window size bytes.
> Yes, larger window improves the compression ratio, see here a comparison
> between level 19 and 22 on my testing x86-64 initramfs:
> 30775022 rootfs.cpio.zst-19
> 28755429 rootfs.cpio.zst-22
> These 7% can be noticeable when one has a slow storage, e.g. a flash memory
> on SPI bus.
>
>>> I removed that limitation to be able to test it in my environment and I
>>> found the performance is worst than with my patch by roughly 20% (on
>>> i7-3520M), which is a major drawback considering the main motivation
>>> to use zstd is the decompression speed. I will test on arm as well and
>>> share the result tomorrow.
>>> Petr
>>
>> What do you mean by that? Can you share with me the test you ran?
>> Is this for kernel decompression or initramfs decompression?
> Initramfs - you can apply my v2 patch on v5.5 and try with your test data.
>
> I have tested your patch also on ARMv7 platform and there the degradation
> was 8%.
Thanks again for measuring the speed differences between the two patchsets!
Iâve found that the difference in performance between our two patchsets is
caused by the output buffer size. I expect this is due to calling flush() more often,
since that is a complex state machine in initramfsâs use case.
Iâve measured the speed of this patch set (v3), compared against this patch set
with a 128 KB buffer size (ZSTD_DStreamOutSize()), vs Petrâs patchset. Iâm
measuring on an Intel i9-9900K with turbo disabled on CPU 0. Iâm booting the
kernel using QEMU.
To measure the initramfs decompression speed I look at the difference in timestamp
between âUnpacking initramfsââ and âFreeing ignited memoryâ. The initramfs is
compressed using level 19, but results for level 22 are similar. Times are reported
in seconds. I ran each test 3 times and took the median time, but the results are
very stable. On ARM the initramfs is 26 MB. On x86-64 the initramfs is 97 MB.
Arch v3 128 Petr
Arm 1.67 1.52 1.55
x64 1.76 1.69 1.66
The results for my patch are slightly better on ARM, yours are slightly better on x86.
In v4 of my patchset, which I will send out tonight, I will increase ZSTD_IOBUF_SIZE
to 128 KB (as well as remove the 8 MB window size limit). Please let me know if your
results align with mine on v4.
Iâve also measured the x86_64 zstd kernel decompression speed using our two
patchsets. I measured it by the timing between the âDecompressing Linuxââ message
and the âParsing ELFâ message with this script [0]. I used the same technique for
measurement as above. The kernel I am testing is compressed at level 19 with my
patchset and at level 19 with a window size of 4 MB with your patchset.
I found that my patchset takes 70ms to decompress and yours takes 318ms. Your
patchset also uses 4 MB of heap memory, where mine only needs 192 KB. The difference
is caused by two things:
1. memcpy() is replaced by __builtin_memcpy() in patch 1 of my set. This is the
core of the decompression hot loop, and without it the compiler canât inline memcpy.
2. My patchset calls decompress_single() when neither flush nor fill are provided,
like when decompressing the kernel. This saves the 4 MB of memory, as well as
speeds up decompression a little bit.
Best,
Nick Terrell
[0] https://gist.github.com/terrelln/9bd53321a669f62683c608af8944fbc2