Re: [PATCH] crypto: arm64/gcm-ce - unroll factors to 4-way interleave of aes and ghash

From: Ard Biesheuvel
Date: Tue Sep 28 2021 - 17:04:17 EST


On Tue, 28 Sept 2021 at 08:27, Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
>
> On Thu, Sep 23, 2021 at 06:30:25AM +0000, XiaokangQian wrote:
> > To improve performance on cores with deep piplines such as A72,N1,
> > implement gcm(aes) using a 4-way interleave of aes and ghash (totally
> > 8 blocks in parallel), which can make full utilize of pipelines rather
> > than the 4-way interleave we used currently. It can gain about 20% for
> > big data sizes such that 8k.
> >
> > This is a complete new version of the GCM part of the combined GCM/GHASH
> > driver, it will co-exist with the old driver, only serve for big data
> > sizes. Instead of interleaving four invocations of AES where each chunk
> > of 64 bytes is encrypted first and then ghashed, the new version uses a
> > more coarse grained approach where a chunk of 64 bytes is encrypted and
> > at the same time, one chunk of 64 bytes is ghashed (or ghashed and
> > decrypted in the converse case).
> >
> > The table below compares the performance of the old driver and the new
> > one on various micro-architectures and running in various modes with
> > various data sizes.
> >
> > | AES-128 | AES-192 | AES-256 |
> > #bytes | 1024 | 1420 | 8k | 1024 | 1420 | 8k | 1024 | 1420 | 8k |
> > -------+------+------+-----+------+------+-----+------+------+-----+
> > A72 | 5.5% | 12% | 25% | 2.2% | 9.5%| 23%| -1% | 6.7%| 19% |
> > A57 |-0.5% | 9.3%| 32% | -3% | 6.3%| 26%| -6% | 3.3%| 21% |
> > N1 | 0.4% | 7.6%|24.5%| -2% | 5% | 22%| -4% | 2.7%| 20% |
> >
> > Signed-off-by: XiaokangQian <xiaokang.qian@xxxxxxx>
>
> Does this pass the self-tests, including the fuzz tests which are enabled by
> CONFIG_CRYPTO_MANAGER_EXTRA_TESTS=y?
>

Please test both little-endian and big-endian. (Note that you don't
need a big-endian user space for this - the self tests are executed
before the rootfs is mounted)

Also, you will have to rebase this onto the latest cryptodev tree,
which carries some changes I made recently to this driver.

Finally, I'd like to discuss whether we really need two separate
drivers here. The 1k data point is not as relevant as the other ones,
which show a worthwhile speedup for all micro architectures and data
sizes (although I will give this a spin on TX2 myself when I have the
chance)

*If* we switch to this implementation completely, I would like to keep
the improvement I added recently to the decrypt path to compare the
tag using SIMD code, rather than copying it out and using memcmp().
Could you look into adopting this for this version as well?

--
Ard.