Re: [PATCH] crypto: ecc - Optimize vli additive operations using compiler builtins

From: Fabian

Date: Tue Jun 23 2026 - 15:10:56 EST

On Tue, 23 Jun 2026 at 15:37, Lukas Wunner <lukas@xxxxxxxxx> wrote:

> The kernel is much less encumbered, the minimum compiler versions are
> apparent from Documentation/process/changes.rst. If these compiler
> versions support the builtins you're using then everything should be
> alright.

Yes, I have tested this for both the clang and gcc minimum
required versions.

> > This is quite interesting, since, as far as I know, the kernel compiles
> > with gcc and O2 by default, yet the macro-level benchmarks still show a
> > performance increase. The effect seems to be reversed when crypto/ecc.c
> > gets compiled. Or maybe the linux kernel uses some additional
> > optimization flags, I am unsure.
>
> You can compile the kernel with V=1 to see the full command line.

I did this and it seems the kernel does not add any optimization flags
that would seem to affect my code (other than of course -O2).

I have looked at the generated assembly again, and I've found the reason
my code is faster in the kernel, but not in the micro-benchmarks at gcc -O2.
The carry chains from the vli functions from my patched code are identical
in the kernel & micro-benchmarks. However, for some reason, the original
vli carry chains generate an extra cmov instruction in the kernel, but not in
the micro-benchmark, causing an additional dependency chain for each limb.
The original ecc.o object file is also ~10% bigger, which may be another
factor.

>
> > However, most of the time, the patched version outperforms the original
> > one by a wide margin:
> > - On clang -O2 or -O3, vli_add and vli_uadd show a 4.074x and 5.384x
> > speedup.
> > - On gcc, vli_uadd shows a 74% performance increase at O2,
> > and a 2.07x speedup at O3.
>
> There is precedent in the tree for overriding the default -O2 with -O3,
> see lib/lz4/Makefile and arch/mips/vdso/Makefile.
>
> It might be worth using that for crypto/ecc.c if it doesn't cause
> breakage and yields a significant speedup.
>

I believe this as well. The gcc -O2 code is still about ~3.5x slower
than the gcc -O3 version from the microbenchmarks. What
code gcc -O3 will actually generate in the kernel is another thing, but I'm
pretty certain it would improve performance further for the vli operations.

However, how this affects binary size, speed or even breakage in other
areas is something entirely different.

> Previously we discussed replacing the ECC point multiplication algorithm
> used by crypto/ecc.c with a newer constant time Montgomery ladder.
> If you are interested in continuing working on crypto/ecc.c,
> this might be a worthwhile topic:
>
> https://lore.kernel.org/r/aftFAexDFrYbIeBM@xxxxxxxxx/
>

I will for sure look into this in the future, however, that would definitely
take considerably more time.

Thanks for the reply,

Fabian