Re: [PATCH] lib/crypto: blake2b: Limit frame size workaround to GCC < 12.2 on i386

From: Eric Biggers
Date: Mon Nov 24 2025 - 17:40:36 EST


On Mon, Nov 24, 2025 at 06:14:31PM +0100, Jason A. Donenfeld wrote:
> On Mon, Nov 24, 2025 at 10:08 AM david laight <david.laight@xxxxxxxxxx> wrote:
> > > How about we roll up the BLAKE2b rounds loop if !CONFIG_64BIT?
> >
> > I do wonder about the real benefit of some of the massive loop unrolling
> > that happens in a lot of these algorithms (not just blake2b).
>
> I remember looking at this in the context of blake2s, with two paths,
> depending on CONFIG_CC_OPTIMIZE_FOR_SIZE, but the savings didn't seem
> enough for the performance hit. It might be platform specific though.
> I guess try it and post numbers, and that'll either be a compelling
> reason to adjust it or still "meh"?

Earlier I did some quick microbenchmarks with blake2b_kunit. The
existing unrolling does increase throughput by as much as 50%. It's
probably mostly due to inlining the blake2b_sigma constants.

However, the increased code size is a real issue that doesn't show up in
that microbenchmark. Naturally, it will be especially bad on 32-bit
CPUs, given that BLAKE2b works with 64-bit words. The 32-bit code gets
the code size blow-up from emulating the 64-bit arithmetic using 32-bit
instructions, in addition to the unrolling. Rolling up the rounds loop
when !CONFIG_64BIT seems like a reasonable first step.

We could consider rolling up the rounds loop even when CONFIG_64BIT. If
optimal BLAKE2b throughput was actually important on x86_64, we should
have an AVX optimized implementation anyway. But no one has ever cared
to add one. I think btrfs is the only user currently, but btrfs's use
case is non-cryptographic and it already supports much faster
non-cryptographic checksums (crc32c and xxhash64).

- Eric