Re: [PATCH 0/7] crypto: SHA256 multibuffer implementation
From: Herbert Xu
Date: Mon Jun 27 2016 - 05:05:10 EST
On Thu, Jun 23, 2016 at 06:40:41PM -0700, Megha Dey wrote:
> From: Megha Dey <megha.dey@xxxxxxxxxxxxxxx>
>
> In this patch series, we introduce the multi-buffer crypto algorithm on
> x86_64 and apply it to SHA256 hash computation. The multi-buffer technique
> takes advantage of the 8 data lanes in the AVX2 registers and allows
> computation to be performed on data from multiple jobs in parallel.
> This allows us to parallelize computations when data inter-dependency in
> a single crypto job prevents us to fully parallelize our computations.
> The algorithm can be extended to other hashing and encryption schemes
> in the future.
>
> On multi-buffer SHA256 computation with AVX2, we see throughput increase
> up to 2.2x over the existing x86_64 single buffer AVX2 algorithm.
>
> The multi-buffer crypto algorithm is described in the following paper:
> Processing Multiple Buffers in Parallel to Increase Performance on
> Intel® Architecture Processors
> http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html
>
> The outline of the algorithm is sketched below:
> Any driver requesting the crypto service will place an async
> crypto request on the workqueue. The multi-buffer crypto daemon will
> pull request from work queue and put each request in an empty data lane
> for multi-buffer crypto computation. When all the empty lanes are filled,
> computation will commence on the jobs in parallel and the job with the
> shortest remaining buffer will get completed and be returned. To prevent
> prolonged stall when there is no new jobs arriving, we will flush a crypto
> job if it has not been completed after a maximum allowable delay.
>
> To accommodate the fragmented nature of scatter-gather, we will keep
> submitting the next scatter-buffer fragment for a job for multi-buffer
> computation until a job is completed and no more buffer fragments remain.
> At that time we will pull a new job to fill the now empty data slot.
> We call a get_completed_job function to check whether there are other
> jobs that have been completed when we job when we have no new job arrival
> to prevent extraneous delay in returning any completed jobs.
>
> The multi-buffer algorithm should be used for cases where crypto jobs
> submissions are at a reasonable high rate. For low crypto job submission
> rate, this algorithm will not be beneficial. The reason is at low rate,
> we do not fill out the data lanes before the maximum allowable latency,
> we will be flushing the jobs instead of processing them with all the
> data lanes full. We will miss the benefit of parallel computation,
> and adding delay to the processing of the crypto job at the same time.
> Some tuning of the maximum latency parameter may be needed to get the
> best performance.
>
> Note that the tcrypt SHA256 speed test, we wait for a previous job to
> be completed before submitting a new job. Hence this is not a valid
> test for multi-buffer algorithm as it requires multiple outstanding jobs
> submitted to fill the all data lanes to be effective (i.e. 8 outstanding
> jobs for the AVX2 case). An updated version of the tcrypt test is also
> included which would contain a more appropriate test for this scenario.
>
> As this is the first algorithm in the kernel's crypto library
> that we have tried to use multi-buffer optimizations, feedbacks
> and testings will be much appreciated.
All applied. Thanks.
--
Email: Herbert Xu <herbert@xxxxxxxxxxxxxxxxxxx>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt