Re: [PATCH 0/3] ARM ZSTD boot compression

From: Nick Terrell
Date: Thu Oct 12 2023 - 18:33:33 EST




> On Apr 14, 2023, at 10:00 PM, Jonathan Neuschäfer <j.neuschaefer@xxxxxxx> wrote:
>
> On Thu, Apr 13, 2023 at 01:13:21PM +0200, Arnd Bergmann wrote:
>> On Wed, Apr 12, 2023, at 23:33, Arnd Bergmann wrote:
>>> On Wed, Apr 12, 2023, at 23:21, Jonathan Neuschäfer wrote:
>>>> This patchset enables ZSTD kernel (de)compression on 32-bit ARM.
>>>> Unfortunately, it is much slower than I hoped (tested on ARM926EJ-S):
>>>>
>>>> - LZO: 7.2 MiB, 6 seconds
>>>> - ZSTD: 5.6 MiB, 60 seconds
>>>
>>> That seems unexpected, as the usual numbers say it's about 25%
>>> slower than LZO. Do you have an idea why it is so much slower
>>> here? How long does it take to decompress the
>>> generated arch/arm/boot/Image file in user space on the same
>>> hardware using lzop and zstd?
>>
>> I looked through this a bit more and found two interesting points:
>>
>> - zstd uses a lot more unaligned loads and stores while
>> decompressing. On armv5 those turn into individual byte
>> accesses, while the others can likely use word-aligned
>> accesses. This could make a huge difference if caches are
>> disabled during the decompression.
>>
>> - The sliding window on zstd is much larger, with the kernel
>> using an 8MB window (zstd=23), compared to the normal 32kb
>> for deflate (couldn't find the default for lzo), so on
>> machines with no L2 cache, it is much likely to thrash a
>> small L1 dcache that are used on most arm9.
>>
>> Arnd
>
> Make sense.
>
> For ZSTD as used in kernel decompression (the zstd22 configuration), the
> window is even bigger, 128 MiB. (AFAIU)

Sorry, I’m a bit late to the party, I wasn’t getting LKML email for some time...

But this is totally configurable. You can switch compression configurations
at any time. If you believe that the window size is the issue causing speed
regressions, you could use a zstd compression to use a e.g. 256KB window
size like this:

zstd -19 --zstd=wlog=18

This will keep the same algorithm search strength, but limit the decoder memory
usage.

I will also try to get this patchset working on my machine, and try to debug.
The 10x slower speed difference is not expected, and we see much better speed
in userspace ARM. I suspect it has something to do with the preboot environment.
E.g. when implementing x86-64 zstd kernel decompression, I noticed that
memcpy(dst, src, 16) wasn’t getting inlined properly, causing a massive performance
penalty.

Best,
Nick Terrell

> Thanks
>
> Jonathan