Re: [RFC V1 3/7] crypto: ghash - Optimized GHASH computations

From: Ard Biesheuvel
Date: Sat Dec 19 2020 - 12:07:04 EST

Next message: Roman Gushchin: "Re: [PATCH v2 2/2] memblock: do not start bottom-up allocations with kernel_end"
Previous message: Kalesh Singh: "[PATCH] mm: mremap - Fix extent calculation"
In reply to: Megha Dey: "[RFC V1 3/7] crypto: ghash - Optimized GHASH computations"
Next in thread: Megha Dey: "[RFC V1 4/7] crypto: tcrypt - Add speed test for optimized GHASH computations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, 18 Dec 2020 at 22:07, Megha Dey <megha.dey@xxxxxxxxx> wrote:
>
> From: Kyung Min Park <kyung.min.park@xxxxxxxxx>
>
> Optimize GHASH computations with the 512 bit wide VPCLMULQDQ instructions.
> The new instruction allows to work on 4 x 16 byte blocks at the time.
> For best parallelism and deeper out of order execution, the main loop of
> the code works on 16 x 16 byte blocks at the time and performs reduction
> every 48 x 16 byte blocks. Such approach needs 48 precomputed GHASH subkeys
> and the precompute operation has been optimized as well to leverage 512 bit
> registers, parallel carry less multiply and reduction.
>
> VPCLMULQDQ instruction is used to accelerate the most time-consuming
> part of GHASH, carry-less multiplication. VPCLMULQDQ instruction
> with AVX-512F adds EVEX encoded 512 bit version of PCLMULQDQ instruction.
>
> The glue code in ghash_clmulni_intel module overrides existing PCLMULQDQ
> version with the VPCLMULQDQ version when the following criteria are met:
> At compile time:
> 1. CONFIG_CRYPTO_AVX512 is enabled
> 2. toolchain(assembler) supports VPCLMULQDQ instructions
> At runtime:
> 1. VPCLMULQDQ and AVX512VL features are supported on a platform (currently
> only Icelake)
> 2. If compiled as built-in module, ghash_clmulni_intel.use_avx512 is set at
> boot time or /sys/module/ghash_clmulni_intel/parameters/use_avx512 is set
> to 1 after boot.
> If compiled as loadable module, use_avx512 module parameter must be set:
> modprobe ghash_clmulni_intel use_avx512=1
>
> With new implementation, tcrypt ghash speed test shows about 4x to 10x
> speedup improvement for GHASH calculation compared to the original
> implementation with PCLMULQDQ when the bytes per update size is 256 Bytes
> or above. Detailed results for a variety of block sizes and update
> sizes are in the table below. The test was performed on Icelake based
> platform with constant frequency set for CPU.
>
> The average performance improvement of the AVX512 version over the current
> implementation is as follows:
> For bytes per update >= 1KB, we see the average improvement of 882%(~8.8x).
> For bytes per update < 1KB, we see the average improvement of 370%(~3.7x).
>
> A typical run of tcrypt with GHASH calculation with PCLMULQDQ instruction
> and VPCLMULQDQ instruction shows the following results.
>
> ---------------------------------------------------------------------------
> | | | cycles/operation | |
> | | | (the lower the better) | |
> | byte | bytes |----------------------------------| percentage |
> | blocks | per update | GHASH test | GHASH test | loss/gain |
> | | | with PCLMULQDQ | with VPCLMULQDQ | |
> |------------|------------|----------------|-----------------|------------|
> | 16 | 16 | 144 | 233 | -38.0 |
> | 64 | 16 | 535 | 709 | -24.5 |
> | 64 | 64 | 210 | 146 | 43.8 |
> | 256 | 16 | 1808 | 1911 | -5.4 |
> | 256 | 64 | 865 | 581 | 48.9 |
> | 256 | 256 | 682 | 170 | 301.0 |
> | 1024 | 16 | 6746 | 6935 | -2.7 |
> | 1024 | 256 | 2829 | 714 | 296.0 |
> | 1024 | 1024 | 2543 | 341 | 645.0 |
> | 2048 | 16 | 13219 | 13403 | -1.3 |
> | 2048 | 256 | 5435 | 1408 | 286.0 |
> | 2048 | 1024 | 5218 | 685 | 661.0 |
> | 2048 | 2048 | 5061 | 565 | 796.0 |
> | 4096 | 16 | 40793 | 27615 | 47.8 |
> | 4096 | 256 | 10662 | 2689 | 297.0 |
> | 4096 | 1024 | 10196 | 1333 | 665.0 |
> | 4096 | 4096 | 10049 | 1011 | 894.0 |
> | 8192 | 16 | 51672 | 54599 | -5.3 |
> | 8192 | 256 | 21228 | 5284 | 301.0 |
> | 8192 | 1024 | 20306 | 2556 | 694.0 |
> | 8192 | 4096 | 20076 | 2044 | 882.0 |
> | 8192 | 8192 | 20071 | 2017 | 895.0 |
> ---------------------------------------------------------------------------
>
> This work was inspired by the AES GCM mode optimization published
> in Intel Optimized IPSEC Cryptographic library.
> https://github.com/intel/intel-ipsec-mb/lib/avx512/gcm_vaes_avx512.asm
>
> Co-developed-by: Greg Tucker <greg.b.tucker@xxxxxxxxx>
> Signed-off-by: Greg Tucker <greg.b.tucker@xxxxxxxxx>
> Co-developed-by: Tomasz Kantecki <tomasz.kantecki@xxxxxxxxx>
> Signed-off-by: Tomasz Kantecki <tomasz.kantecki@xxxxxxxxx>
> Signed-off-by: Kyung Min Park <kyung.min.park@xxxxxxxxx>
> Co-developed-by: Megha Dey <megha.dey@xxxxxxxxx>
> Signed-off-by: Megha Dey <megha.dey@xxxxxxxxx>

Hello Megha,

What is the purpose of this separate GHASH module? GHASH is only used
in combination with AES-CTR to produce GCM, and this series already
contains a GCM driver.

Do cores exist that implement PCLMULQDQ but not AES-NI?

If not, I think we should be able to drop this patch (and remove the
existing PCLMULQDQ GHASH driver as well)

Next message: Roman Gushchin: "Re: [PATCH v2 2/2] memblock: do not start bottom-up allocations with kernel_end"
Previous message: Kalesh Singh: "[PATCH] mm: mremap - Fix extent calculation"
In reply to: Megha Dey: "[RFC V1 3/7] crypto: ghash - Optimized GHASH computations"
Next in thread: Megha Dey: "[RFC V1 4/7] crypto: tcrypt - Add speed test for optimized GHASH computations"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]