Do you have a plan for how this will be merged? Which trees will the patches go
through? I think that the actual AES-XTS implementation could still use a bit
more polishing; see my comments below. However, patches 1-12 don't need to wait
for that. Perhaps the x86 maintainers would like to take patches 1-12 for
v6.10? Then the AES-XTS support can go through the crypto tree afterwards.
As you noticed, this cycle I've been optimizing AES-XTS for x86_64 by adding new
VAES and AES-NI + AVX implementations. I have some ideas for the Key Locker
based implementation of AES-XTS:
First, surely it's the case that in practice, all CPUs that support Key Locker
also support AVX? If so, then there's no need for the Key Locker assembly to
use legacy SSE instructions. It should instead target AVX and use VEX-coded
instructions. This would save some instructions and improve performance.
Since the Key Locker assembly only supports 64-bit mode, it should also feel
free to use registers xmm8-xmm15 for purposes such as caching the XTS tweaks.
This would improve performance.
Since the Key Locker assembly advances a large number of XTS tweaks at a time
(8), I'm also wondering if it would be faster to multiply by x^8 directly
instead of multiplying by x sequentially eight times. This can be done using
the pclmulqdq instruction; see aes-xts-avx-x86_64.S which implements this
optimization. Probably all CPUs that support Key Locker also support PCLMULQDQ.
I'm also trying to think of the best way to organize the Key Locker AES-XTS glue
code. I see that you're proposing to share the glue code with the existing
AES-XTS implementations. Unfortunately I don't think this ends up working very
well, due to the facts that the Key Locker code can return errors and uses a
different key type. I think that for now, I'd prefer that you simply copied the
XTS glue code into aeskl-intel_glue.c and modified it as needed. (But make sure
to use the new version of the glue code, which is faster.)
For falling back to AES-NI, I think the cleanest solution is to call the
top-level setkey, encrypt, and decrypt functions (the ones that are set in the
xts-aes-aesni skcipher_alg), instead of calling lower-level functions as your
current patchset does.
If you could keep the Key Locker assembly roughly stylistically consistent with
the new aes-xts-avx-x86_64.S, that would be great too.
Do you happen to know if there's any way to test the Key Locker AES-XTS code
without having access to a bare metal machine with a CPU that supports Key
Locker? I tried a Sapphire Rapids based VM in Google Compute Engine, but it
doesn't enumerate Key Locker. I also don't see anything in QEMU related to Key
Locker. So I don't currently have an easy way to test this patchset.
Finally, a high level question. Key Locker has been reported to be
substantially slower than AES-NI. At the same time, VAES has recently doubled
performance over AES-NI. I'd guess this leaves Key Locker even further behind.
Given that, how useful is this patchset? I'm a bit concerned that this might be
something that sounds good in theory but won't be used in practice. Are
performance improvements for Key Locker on the horizon? (Well, there are the
improvements I suggested above, which should help, but it sounds like main issue
is the Key Locker instructions themselves which are just fundamentally slower.)