Re: [PATCH 0/6] Faster AES-XTS on modern x86_64 CPUs
From: Eric Biggers
Date: Fri Apr 05 2024 - 15:19:13 EST
On Thu, Apr 04, 2024 at 07:53:48AM +0000, David Laight wrote:
> > >
> > > How much does the kernel_fpu_begin() cost on real workloads?
> > > (ie when the registers are live and it forces an extra save/restore)
> >
> > x86 Linux does lazy restore of the FPU state. The first kernel_fpu_begin() can
> > have a significant cost, as it issues an XSAVE (or equivalent) instruction and
> > causes an XRSTOR (or equivalent) instruction to be issued when returning to
> > userspace when it otherwise might not be needed. Additional kernel_fpu_begin()
> > / kernel_fpu_end() pairs without returning to userspace have only a small cost,
> > as they don't cause any more saves or restores of the FPU state to be done.
> >
> > My new xts(aes) implementations have one kernel_fpu_begin() / kernel_fpu_end()
> > pair per message (if the message doesn't span any page boundaries, which is
> > almost always the case). That's exactly the same as the current xts-aes-aesni.
>
> I realised after sending it that the code almost certainly already did
> kernel_fpu_begin() - so there probably isn't a difference because all the
> fpu state is always saved.
> (I'm sure there should be a way of getting access to (say) 2 ymm registers
> by providing an on-stack save area to allow wide data copies or special
> instructions - but that is a different issue.)
>
> > I think what you may really be asking is how much the overhead of the XSAVE /
> > XRSTOR pair associated with kernel-mode use of the FPU *increases* if the kernel
> > clobbers AVX or AVX512 state, instead of just SSE state as xts-aes-aesni does.
> > That's much more relevant to this patchset.
>
> It depends on what has to be saved, not on what is used.
> Although, since all the x/y/zmm registers are caller-saved I think they could
> be 'zapped' on syscall entry (and restored as zero later).
> Trouble is I suspect there is a single piece of code somewhere that relies
> on them being preserved across an inlined system call.
>
> > I think the answer is that there is no additional overhead. This is because the
> > XSAVE / XRSTOR pair happens regardless of the type of state the kernel clobbers,
> > and it operates on the userspace state, not the kernel's. Some of the newer
> > variants of XSAVE (XSAVEOPT and XSAVES) do have a "modified" optimization where
> > they don't save parts of the state that are unmodified since the last XRSTOR;
> > however, that is unimportant here because the kernel's FPU state is never saved.
> >
> > (This would change if x86 Linux were to support preemption of kernel-mode FPU
> > code. In that case, we may need to take more care to minimize use of AVX and
> > AVX512 state. That being said, AES-XTS tends to be used for bulk data anyway.)
> >
> > This is based on theory, though. I'll do a test to confirm that there's indeed
> > no additional overhead. And also, even if there's no additional overhead, what
> > the existing overhead actually is.
>
> Yes, I was wondering how it is used for 'real applications'.
> If a system call that would normally return immediately (or at least without
> a full process switch) hits the aes code it gets the cost of the XSAVE added.
> Whereas the benchmark probably doesn't do anywhere near as many.
>
> OTOH this is probably no different.
I did some tests on Sapphire Rapids using a system call that I customized to do
nothing except possibly a kernel_fpu_begin / kernel_fpu_end pair.
On average the bare syscall took 70 ns. The syscall with the kernel_fpu_begin /
kernel_fpu_end pair took 160 ns if the userspace program used xmm only, 340 ns
if it used ymm, or 360 ns if it used zmm. I also tried making the kernel
clobber different registers in the kernel_fpu_begin / kernel_fpu_end section,
and as I expected this did not make any difference.
Note that without the kernel_fpu_begin / kernel_fpu_end pair, AES-NI
instructions cannot be used and the alternative would be xts(ecb(aes-generic)).
On the same CPU, encrypting a single 512-byte sector with xts(ecb(aes-generic))
takes about 2235ns. With xts-aes-vaes-avx10_512 it takes 75 ns. (Not a typo --
it really is almost 30 times faster!) So it seems clear the FPU state save and
restore is worth it even just for a single sector using the traditional 512-byte
sector size, let alone a 4096-byte sector size which is recommended these days.
- Eric