Re: [PATCH 0/6] crypto: SHA512 multibuffer implementation
From: Herbert Xu
Date: Tue Jun 28 2016 - 04:37:27 EST
On Mon, Jun 27, 2016 at 10:20:03AM -0700, Megha Dey wrote:
> From: Megha Dey <megha.dey@xxxxxxxxxxxxxxx>
>
> In this patch series, we introduce the multi-buffer crypto algorithm on
> x86_64 and apply it to SHA512 hash computation. The multi-buffer technique
> takes advantage of the 8 data lanes in the AVX2 registers and allows
> computation to be performed on data from multiple jobs in parallel.
> This allows us to parallelize computations when data inter-dependency in
> a single crypto job prevents us to fully parallelize our computations.
> The algorithm can be extended to other hashing and encryption schemes
> in the future.
>
> On multi-buffer SHA512 computation with AVX2, we see throughput increase
> up to 2x over the existing x86_64 single buffer AVX2 algorithm.
>
> The multi-buffer crypto algorithm is described in the following paper:
> Processing Multiple Buffers in Parallel to Increase Performance on
> Intel® Architecture Processors
> http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html
>
> The outline of the algorithm is sketched below:
> Any driver requesting the crypto service will place an async
> crypto request on the workqueue. The multi-buffer crypto daemon will
> pull request from work queue and put each request in an empty data lane
> for multi-buffer crypto computation. When all the empty lanes are filled,
> computation will commence on the jobs in parallel and the job with the
> shortest remaining buffer will get completed and be returned. To prevent
> prolonged stall when there is no new jobs arriving, we will flush a crypto
> job if it has not been completed after a maximum allowable delay.
>
> The multi-buffer algorithm necessitates mapping multiple scatter gather
> buffers to linear addresses simultaneously. The crypto daemon may need
> to sleep and yield the cpu to work on something else from time to time.
> We made a change to not use kmap_atomic to do scatter-gather buffer
> mapping and take advantage of the fact that we can directly translate
> address the buffer's address to its linear address with x86_64.
> To accommodate the fragmented nature of scatter-gather, we will keep
> submitting the next scatter-buffer fragment for a job for multi-buffer
> computation until a job is completed and no more buffer fragments remain.
> At that time we will pull a new job to fill the now empty data slot.
> We call a get_completed_job function to check whether there are other
> jobs that have been completed when we job when we have no new job arrival
> to prevent extraneous delay in returning any completed jobs.
>
> The multi-buffer algorithm should be used for cases where crypto jobs
> submissions are at a reasonable high rate. For low crypto job submission
> rate, this algorithm will not be beneficial. The reason is at low rate,
> we do not fill out the data lanes before the maximum allowable latency,
> we will be flushing the jobs instead of processing them with all the
> data lanes full. We will miss the benefit of parallel computation,
> and adding delay to the processing of the crypto job at the same time.
> Some tuning of the maximum latency parameter may be needed to get the
> best performance.
>
> Also added, is a new mode in the tcrypt modules to calculate the speed of the
> sha512_mb algorithm.
All applied. Thanks.
--
Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt