RE: [PATCH v2 09/19] crypto: x86 - use common macro for FPU limit

From: Elliott, Robert (Servers)
Date: Thu Oct 13 2022 - 17:48:34 EST




> -----Original Message-----
> From: Jason A. Donenfeld <Jason@xxxxxxxxx>
> Sent: Wednesday, October 12, 2022 7:36 PM
> To: Elliott, Robert (Servers) <elliott@xxxxxxx>
> Cc: herbert@xxxxxxxxxxxxxxxxxxx; davem@xxxxxxxxxxxxx;
> tim.c.chen@xxxxxxxxxxxxxxx; ap420073@xxxxxxxxx; ardb@xxxxxxxxxx; linux-
> crypto@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH v2 09/19] crypto: x86 - use common macro for FPU limit
>
> On Wed, Oct 12, 2022 at 04:59:21PM -0500, Robert Elliott wrote:
> > Use a common macro name (FPU_BYTES) for the limit of the number of bytes
> > processed within kernel_fpu_begin and kernel_fpu_end rather than
> > using SZ_4K (which is a signed value), or a magic value of 4096U.
>
> Not sure I like this very much. The whole idea is that this is variable
> per algorithm, since not all algorithms have the same performance
> characteristics.

Good point.

I noticed the powerpc aes, sha1, and sha256 modules include explanations
of how their macros serving a similar purpose were calculated.

arch/powerpc/crypto/aes-spe-glue.c:
* MAX_BYTES defines the number of bytes that are allowed to be processed
* between preempt_disable() and preempt_enable(). e500 cores can issue two
* instructions per clock cycle using one 32/64 bit unit (SU1) and one 32
* bit unit (SU2). One of these can be a memory access that is executed via
* a single load and store unit (LSU). XTS-AES-256 takes ~780 operations per
* 16 byte block or 25 cycles per byte. Thus 768 bytes of input data
* will need an estimated maximum of 20,000 cycles. Headroom for cache misses
* included. Even with the low end model clocked at 667 MHz this equals to a
* critical time window of less than 30us. The value has been chosen to
* process a 512 byte disk block in one or a large 1400 bytes IPsec network
* packet in two runs.
#define MAX_BYTES 768

and arch/powerpc/crypto/sha1-spe-glue.c:
* MAX_BYTES defines the number of bytes that are allowed to be processed
* between preempt_disable() and preempt_enable(). SHA1 takes ~1000
* operations per 64 bytes. e500 cores can issue two arithmetic instructions
* per clock cycle using one 32/64 bit unit (SU1) and one 32 bit unit (SU2).
* Thus 2KB of input data will need an estimated maximum of 18,000 cycles.
* Headroom for cache misses included. Even with the low end model clocked
* at 667 MHz this equals to a critical time window of less than 27us.
#define MAX_BYTES 2048

arch/powerpc/crypto/sha256-spe-glue.c:
* MAX_BYTES defines the number of bytes that are allowed to be processed
* between preempt_disable() and preempt_enable(). SHA256 takes ~2,000
* operations per 64 bytes. e500 cores can issue two arithmetic instructions
* per clock cycle using one 32/64 bit unit (SU1) and one 32 bit unit (SU2).
* Thus 1KB of input data will need an estimated maximum of 18,000 cycles.
* Headroom for cache misses included. Even with the low end model clocked
* at 667 MHz this equals to a critical time window of less than 27us.
#define MAX_BYTES 1024

Perhaps we should declare a time goal like "30 us," measure the actual
speed of each algorithm with a tcrypt speed test, and calculate the
nominal value assuming some slow x86 CPU core speed?

That could be further adjusted at run-time based on the supposed
minimum CPU frequency (e.g., as reported in
/sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq).

If values less than 4 KiB are necessary (e.g., like the powerpc
values), that will require changes in all the modules using
skcipher_walk too.

> So in that sense, it's better to put this close to
> where it's actually used, rather than somewhere at the top of the file.
> When you do that, it makes it seem like "FPU_BYTES" is some universal
> constant, which of course it isn't.
>
> Instead, declare this as an untyped enum value within the function.

Many of these modules use the same value for both an _update and
a _finit function (usually pretty close to each other). Is it
important to avoid replication?