[PATCH 0/2] crypto: qce driver fixes for gcm
From: Eneas U de Queiroz
Date:  Mon Feb 03 2020 - 11:54:21 EST
I've finally managed to get gcm(aes) working with the qce crypto engine.
These first patch fixes a bug where the gcm authentication tag was being
overwritten during gcm decryption, because it was passed in the same sgl
buffer as the crypto payload.  The qce driver appends some private state
buffer to the request destination sgl, but it was not checking the
length of the sgl being passed.
The second patch works around a problem, which I frankly can't pinpoint
what exactly is the cause, but after some help from Ard Biesheuvel, I
think it is related to DMA.  When gcm sends a request in
crypto_gcm_setkey, it stores the hash (the crypto payload) and the iv in
the same data struct.  When the drivers updates the IV, then the payload
gets overwritten with the unencrypted data, or all zeroes, it may be a
coincidence.
However, it works if I pass the request down to the fallback driver--it
is used by the driver to accept 192-bit-key requests.  All I had to do
was setup the fallback regardless of key size, and then check the
payload length along with the keysize to pass the request to the
fallback.  This turns out to enhance performance, because of the
avoided latency that comes with using the hardware.
I've started with checking for a single 16-byte AES block, and that is
enough to make gcm work.  Next thing I've done was to tune the request
size for performance.  What got me started into looking at the qce
driver was reports of it being detrimental to VPN speed, by the way.
I've tested this win an Asus RT-AC58U, but the slow VPN reports[1] have
more devices affected.  Access to the device was kindly provided by
@simsasss.
I've used the openssl speed util to measure the speed, with an AF_ALG
engine I've written to make use of the kernel driver from userspace[2],
running on 4.19.78--I can't run this on a newer kernel yet.
TLDR: In the worst (where the hardware is slowest) case, hardware and
software speed match at aroung 768 bytes, but I lowered the threshold to
512 to benefit the CPU offload.
Here's the script I've used:
#!/bin/sh
for len in 256 512 768 1024; do
  echo Block-size: ${len} bytes
  for key in 128 256; do
    for mode in cbc ctr ecb; do
      rmmod qcrypto
      openssl speed -elapsed -evp aes-${key}-${mode} -engine afalg \
                -bytes ${len} 2>&1 \
        | grep ^aes \
        | sed "s/aes-${key}-${mode}     /aes-${key}-${mode} soft/"
      insmod /tmp/qcrypto.ko
      openssl speed -elapsed -evp aes-${key}-${mode} -engine afalg \
                -bytes ${len} 2>&1 \
        | grep ^aes \
        | sed "s/aes-${key}-${mode}     /aes-${key}-${mode} qce /"
    done
  done
done
Here's a sample run--numbers vary from run to run, sometimes greatly:
./test_speed.sh
Block-size: 256 bytes
aes-128-cbc soft  6808.92k
aes-128-cbc qce   2704.10k
aes-128-ctr soft  6785.63k
aes-128-ctr qce   2675.07k
aes-128-ecb soft  7596.86k
aes-128-ecb qce   2772.16k
aes-256-cbc soft  5970.02k
aes-256-cbc qce   2678.84k
aes-256-ctr soft  6164.46k
aes-256-ctr qce   2634.15k
aes-256-ecb soft  6529.03k
aes-256-ecb qce   2720.88k
Block-size: 512 bytes
aes-128-cbc soft  9402.31k
aes-128-cbc qce   5345.69k
aes-128-ctr soft  9766.23k
aes-128-ctr qce   5179.25k
aes-128-ecb soft 10638.85k
aes-128-ecb qce   5437.13k
aes-256-cbc soft  7742.98k
aes-256-cbc qce   5230.08k
aes-256-ctr soft  8174.93k
aes-256-ctr qce   5115.89k
aes-256-ecb soft  8772.61k
aes-256-ecb qce   7282.35k
Block-size: 768 bytes
aes-128-cbc soft 10466.38k
aes-128-cbc qce   7814.59k
aes-128-ctr soft 11161.69k
aes-128-ctr qce   7639.93k
aes-128-ecb soft 12122.37k
aes-128-ecb qce  10764.84k
aes-256-cbc soft  8725.50k
aes-256-cbc qce   9184.41k
aes-256-ctr soft  9233.15k
aes-256-ctr qce   7392.32k
aes-256-ecb soft 10039.30k
aes-256-ecb qce   9148.45k
Block-size: 1024 bytes
aes-128-cbc soft 11418.80k
aes-128-cbc qce  12314.37k
aes-128-ctr soft 11940.86k
aes-128-ctr qce  11982.51k
aes-128-ecb soft 13350.23k
aes-128-ecb qce  10375.28k
aes-256-cbc soft  9003.32k
aes-256-cbc qce  12017.66k
aes-256-ctr soft  9898.89k
aes-256-ctr qce   9672.18k
aes-256-ecb soft 10679.74k
aes-256-ecb qce  12314.37k
I imagine that if I were to run the benchmark within the kernel, the
resulting threshould would be eve higher, since there's a pretty much
fixed latency from the context switches.  Nonetheless, I think it's
better to let the engine run more, to offload the CPU.
Cheers,
Eneas
[1] https://forum.openwrt.org/t/ipsec-performance-issue/39690
[2] https://github.com/cotequeiroz/afalg_engine
Eneas U de Queiroz (2):
  crypto: qce - use cryptlen when adding extra sgl
  crypto: qce - use AES fallback when len <= 512
 drivers/crypto/qce/dma.c      | 11 ++++++-----
 drivers/crypto/qce/dma.h      |  2 +-
 drivers/crypto/qce/skcipher.c | 17 +++++++----------
 3 files changed, 14 insertions(+), 16 deletions(-)