RE: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs

From: David Laight
Date: Thu Apr 04 2024 - 03:54:30 EST


From: Eric Biggers
> Sent: 04 April 2024 02:35
>
> Hi David,
>
> On Wed, Apr 03, 2024 at 08:12:09AM +0000, David Laight wrote:
> > From: Eric Biggers
> > > Sent: 26 March 2024 16:48
> > ....
> > > Consider Intel Ice Lake for example, these are the AES-256-XTS encryption speeds
> > > on 4096-byte messages in MB/s I'm seeing:
> > >
> > > xts-aes-aesni 5136
> > > xts-aes-aesni-avx 5366
> > > xts-aes-vaes-avx2 9337
> > > xts-aes-vaes-avx10_256 9876
> > > xts-aes-vaes-avx10_512 10215
> > >
> > > So yes, on that CPU the biggest boost comes just from VAES, staying on AVX2.
> > > But taking advantage of AVX512 does help a bit more, first from the parts other
> > > than 512-bit registers, then a bit more from 512-bit registers.
> >
> > How much does the kernel_fpu_begin() cost on real workloads?
> > (ie when the registers are live and it forces an extra save/restore)
>
> x86 Linux does lazy restore of the FPU state. The first kernel_fpu_begin() can
> have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> causes an XRSTOR (or equivalent) instruction to be issued when returning to
> userspace when it otherwise might not be needed. Additional kernel_fpu_begin()
> / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> as they don't cause any more saves or restores of the FPU state to be done.
>
> My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> pair per message (if the message doesn't span any page boundaries, which is
> almost always the case). That's exactly the same as the current xts-aes-aesni.

I realised after sending it that the code almost certainly already did
kernel_fpu_begin() - so there probably isn't a difference because all the
fpu state is always saved.
(I'm sure there should be a way of getting access to (say) 2 ymm registers
by providing an on-stack save area to allow wide data copies or special
instructions - but that is a different issue.)

> I think what you may really be asking is how much the overhead of the XSAVE /
> XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> That's much more relevant to this patchset.

It depends on what has to be saved, not on what is used.
Although, since all the x/y/zmm registers are caller-saved I think they could
be 'zapped' on syscall entry (and restored as zero later).
Trouble is I suspect there is a single piece of code somewhere that relies
on them being preserved across an inlined system call.

> I think the answer is that there is no additional overhead. This is because the
> XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> and it operates on the userspace state, not the kernel's. Some of the newer
> variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> they don't save parts of the state that are unmodified since the last XRSTOR;
> however, that is unimportant here because the kernel's FPU state is never saved.
>
> (This would change if x86 Linux were to support preemption of kernel-mode FPU
> code. In that case, we may need to take more care to minimize use of AVX and
> AVX512 state. That being said, AES-XTS tends to be used for bulk data anyway.)
>
> This is based on theory, though. I'll do a test to confirm that there's indeed
> no additional overhead. And also, even if there's no additional overhead, what
> the existing overhead actually is.

Yes, I was wondering how it is used for 'real applications'.
If a system call that would normally return immediately (or at least without
a full process switch) hits the aes code it gets the cost of the XSAVE added.
Whereas the benchmark probably doesn't do anywhere near as many.

OTOH this is probably no different.

>
> > I've not looked at the code but I often see what looks like
> > excessive inlining in crypto code.
> > This will speed up benchmarks but can have a negative effect
> > on real code both because of the time taken to load the
> > code and the effect of displacing other code.
> >
> > It might be that this code is a simple loop....
>
> This is a different topic. By "inlining" I assume that you also mean things
> like loop unrolling. I totally agree that some of the crypto assembly code goes
> way overboard on this, resulting in an unreasonably large machine code size.
> The AVX implementation of AES-GCM (aesni-intel_avx-x86_64.S), which was written
> by Intel, is the worst offender by far, generating 256011 bytes of machine code.
> In OpenSSL, Intel has even taken that to the next level with their VAES
> optimized implementation of AES-GCM generating 696040 bytes of machine code.

That is truly stunning!
I can't believe anything that big is actually 'optimised'.
Just think of all the TLB misses :-)
Unless it is slightly faster if you are encrypting several TB of data.

..
> So, I think my current proposal is at a reasonable place regarding compiled code
> size, especially when it's compared to the monstrosity that is some of the
> existing crypto assembly code. But let me know if there are any specific
> choices I've made that you may have a different opinion on.

At least you've thought about code size.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)