[PATCH 0/2] crypto: qce driver fixes for gcm

From: Eneas U de Queiroz
Date: Mon Feb 03 2020 - 11:54:21 EST


I've finally managed to get gcm(aes) working with the qce crypto engine.

These first patch fixes a bug where the gcm authentication tag was being
overwritten during gcm decryption, because it was passed in the same sgl
buffer as the crypto payload. The qce driver appends some private state
buffer to the request destination sgl, but it was not checking the
length of the sgl being passed.

The second patch works around a problem, which I frankly can't pinpoint
what exactly is the cause, but after some help from Ard Biesheuvel, I
think it is related to DMA. When gcm sends a request in
crypto_gcm_setkey, it stores the hash (the crypto payload) and the iv in
the same data struct. When the drivers updates the IV, then the payload
gets overwritten with the unencrypted data, or all zeroes, it may be a
coincidence.

However, it works if I pass the request down to the fallback driver--it
is used by the driver to accept 192-bit-key requests. All I had to do
was setup the fallback regardless of key size, and then check the
payload length along with the keysize to pass the request to the
fallback. This turns out to enhance performance, because of the
avoided latency that comes with using the hardware.

I've started with checking for a single 16-byte AES block, and that is
enough to make gcm work. Next thing I've done was to tune the request
size for performance. What got me started into looking at the qce
driver was reports of it being detrimental to VPN speed, by the way.
I've tested this win an Asus RT-AC58U, but the slow VPN reports[1] have
more devices affected. Access to the device was kindly provided by
@simsasss.

I've used the openssl speed util to measure the speed, with an AF_ALG
engine I've written to make use of the kernel driver from userspace[2],
running on 4.19.78--I can't run this on a newer kernel yet.

TLDR: In the worst (where the hardware is slowest) case, hardware and
software speed match at aroung 768 bytes, but I lowered the threshold to
512 to benefit the CPU offload.

Here's the script I've used:
#!/bin/sh
for len in 256 512 768 1024; do
echo Block-size: ${len} bytes
for key in 128 256; do
for mode in cbc ctr ecb; do
rmmod qcrypto
openssl speed -elapsed -evp aes-${key}-${mode} -engine afalg \
-bytes ${len} 2>&1 \
| grep ^aes \
| sed "s/aes-${key}-${mode} /aes-${key}-${mode} soft/"
insmod /tmp/qcrypto.ko
openssl speed -elapsed -evp aes-${key}-${mode} -engine afalg \
-bytes ${len} 2>&1 \
| grep ^aes \
| sed "s/aes-${key}-${mode} /aes-${key}-${mode} qce /"
done
done
done

Here's a sample run--numbers vary from run to run, sometimes greatly:

./test_speed.sh
Block-size: 256 bytes
aes-128-cbc soft 6808.92k
aes-128-cbc qce 2704.10k
aes-128-ctr soft 6785.63k
aes-128-ctr qce 2675.07k
aes-128-ecb soft 7596.86k
aes-128-ecb qce 2772.16k
aes-256-cbc soft 5970.02k
aes-256-cbc qce 2678.84k
aes-256-ctr soft 6164.46k
aes-256-ctr qce 2634.15k
aes-256-ecb soft 6529.03k
aes-256-ecb qce 2720.88k
Block-size: 512 bytes
aes-128-cbc soft 9402.31k
aes-128-cbc qce 5345.69k
aes-128-ctr soft 9766.23k
aes-128-ctr qce 5179.25k
aes-128-ecb soft 10638.85k
aes-128-ecb qce 5437.13k
aes-256-cbc soft 7742.98k
aes-256-cbc qce 5230.08k
aes-256-ctr soft 8174.93k
aes-256-ctr qce 5115.89k
aes-256-ecb soft 8772.61k
aes-256-ecb qce 7282.35k
Block-size: 768 bytes
aes-128-cbc soft 10466.38k
aes-128-cbc qce 7814.59k
aes-128-ctr soft 11161.69k
aes-128-ctr qce 7639.93k
aes-128-ecb soft 12122.37k
aes-128-ecb qce 10764.84k
aes-256-cbc soft 8725.50k
aes-256-cbc qce 9184.41k
aes-256-ctr soft 9233.15k
aes-256-ctr qce 7392.32k
aes-256-ecb soft 10039.30k
aes-256-ecb qce 9148.45k
Block-size: 1024 bytes
aes-128-cbc soft 11418.80k
aes-128-cbc qce 12314.37k
aes-128-ctr soft 11940.86k
aes-128-ctr qce 11982.51k
aes-128-ecb soft 13350.23k
aes-128-ecb qce 10375.28k
aes-256-cbc soft 9003.32k
aes-256-cbc qce 12017.66k
aes-256-ctr soft 9898.89k
aes-256-ctr qce 9672.18k
aes-256-ecb soft 10679.74k
aes-256-ecb qce 12314.37k

I imagine that if I were to run the benchmark within the kernel, the
resulting threshould would be eve higher, since there's a pretty much
fixed latency from the context switches. Nonetheless, I think it's
better to let the engine run more, to offload the CPU.

Cheers,

Eneas

[1] https://forum.openwrt.org/t/ipsec-performance-issue/39690
[2] https://github.com/cotequeiroz/afalg_engine

Eneas U de Queiroz (2):
crypto: qce - use cryptlen when adding extra sgl
crypto: qce - use AES fallback when len <= 512

drivers/crypto/qce/dma.c | 11 ++++++-----
drivers/crypto/qce/dma.h | 2 +-
drivers/crypto/qce/skcipher.c | 17 +++++++----------
3 files changed, 14 insertions(+), 16 deletions(-)