Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

From: Eric Biggers
Date: Tue Mar 26 2024 - 12:49:13 EST


On Tue, Mar 26, 2024 at 10:51:48AM +0200, Ard Biesheuvel wrote:
> > Open questions:
> >
> > - Is the policy that I implemented for preferring ymm registers to zmm
> > registers the right one? arch/x86/crypto/poly1305_glue.c thinks that
> > only Skylake has the bad downclocking. My current proposal is a bit
> > more conservative; it also excludes Ice Lake and Tiger Lake. Those
> > CPUs supposedly still have some downclocking, though not as much.
> >
> > - Should the policy on the use of zmm registers be in a centralized
> > place? It probably doesn't make sense to have random different
> > policies for different crypto algorithms (AES, Poly1305, ARIA, etc.).
> >
> > - Are there any other known issues with using AVX512 in kernel mode? It
> > seems to work, and technically it's not new because Poly1305 and ARIA
> > already use AVX512, including the mask registers and zmm registers up
> > to 31. So if there was a major issue, like the new registers not
> > being properly saved and restored, it probably would have already been
> > found. But AES-XTS support would introduce a wider use of it.
> >
>
> I don't have much input here, except that I think we should just
> disable AVX512 kernel-wide on systems where there is no benefit in
> terms of throughput. I suspect this might change with algorithms that
> rely more heavily on the masking, but so far, we have been making
> quite effective use of simple permute vectors and overlapping loads
> and stores to do the same. And as Eric points out, the only relevant
> use case in the kernel is blocks of size 2^n where n is at least 9.

There are several benefits to AVX512 besides the 512-bit zmm registers. Besides
masking, there are also twice as many SIMD registers which make it possible to
cache all the AES round keys. There are also other new instructions such as
vpternlogd which I've used in AES-XTS to XOR values together more efficiently.

That's why this patchset adds both xts-aes-vaes-avx10_256 and
xts-aes-vaes-avx10_512. And I've adopted the new "AVX10" naming, maybe a bit
early, to emphasize that it's not just about 512-bit...

Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
on 4096-byte messages in MB/s I'm seeing:

xts-aes-aesni 5136
xts-aes-aesni-avx 5366
xts-aes-vaes-avx2 9337
xts-aes-vaes-avx10_256 9876
xts-aes-vaes-avx10_512 10215

So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
But taking advantage of AVX512 does help a bit more, first from the parts other
than 512-bit registers, then a bit more from 512-bit registers.

I do have Ice Lake on the exclusion list from xts-aes-vaes-avx10_512 anyway,
since the concern with downclocking is not really about the performance of the
code itself but rather the impact on unrelated code running on the CPU.

And I *think* the right policy is to just disable the use of the zmm registers,
as opposed to AVX512 entirely. As AVX512 was originally presented it did tie
these together, but they don't have to be. AVX10 (which supposedly future
x86_64 CPUs will have) explicitly moves away from that by repackaging the
existing AVX512 features and making the zmm registers optional.

- Eric