RE: [PATCH v2 2/2] random: use BLAKE2s instead of SHA1 in extraction

From: David Laight
Date: Tue Jan 11 2022 - 10:47:06 EST


From: Jason A. Donenfeld
> Sent: 11 January 2022 12:50
>
> On Tue, Jan 11, 2022 at 1:28 PM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> > If you're really quite concerned about m68k code size, I can probably
> > do some things to reduce that. For example, blake2s256_hmac is only
> > used by wireguard and it could probably be made local there. And with
> > some trivial loop re-rolling, I can shave off another 2300 bytes. And
> > I bet I can find a few other things too. The question is: how
> > important is this to you?
>
> And with another trick (see below), another extra 1000 bytes or so
> shaved off. Aside from moving blake2s256_hmac, I'm not really super
> enthusiastic about making these changes, but depending on how important
> this is to you, maybe we can make something work. There are probably
> additional possibilities too with the code.
>
> ====
>
> diff --git a/lib/crypto/blake2s-generic.c b/lib/crypto/blake2s-generic.c
> index 75ccb3e633e6..8e3c6372363a 100644
> --- a/lib/crypto/blake2s-generic.c
> +++ b/lib/crypto/blake2s-generic.c
> @@ -46,7 +46,7 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
> {
> u32 m[16];
> u32 v[16];
> - int i;
> + int i, j;

Use unsigned int i, j;
Ensures the '% 4' are done as '& 3' and the divides as shifts.
Unless the compiler manages to track the valid values that will
even generate better code on x86-64.
(Saves a sign extension prior to the array indexes.)

> WARN_ON(IS_ENABLED(DEBUG) &&
> (nblocks > 1 && inc != BLAKE2S_BLOCK_SIZE));
> @@ -76,29 +76,11 @@ void blake2s_compress_generic(struct blake2s_state *state, const u8 *block,
> b = ror32(b ^ c, 7); \
> } while (0)
>
> -#define ROUND(r) do { \
> - G(r, 0, v[0], v[ 4], v[ 8], v[12]); \
> - G(r, 1, v[1], v[ 5], v[ 9], v[13]); \
> - G(r, 2, v[2], v[ 6], v[10], v[14]); \
> - G(r, 3, v[3], v[ 7], v[11], v[15]); \
> - G(r, 4, v[0], v[ 5], v[10], v[15]); \
> - G(r, 5, v[1], v[ 6], v[11], v[12]); \
> - G(r, 6, v[2], v[ 7], v[ 8], v[13]); \
> - G(r, 7, v[3], v[ 4], v[ 9], v[14]); \
> -} while (0)
> - ROUND(0);
> - ROUND(1);
> - ROUND(2);
> - ROUND(3);
> - ROUND(4);
> - ROUND(5);
> - ROUND(6);
> - ROUND(7);
> - ROUND(8);
> - ROUND(9);
> -
> + for (i = 0; i < 10; ++i) {
> + for (j = 0; j < 8; ++j)
> + G(i, j, v[j % 4], v[((j + (j / 4)) % 4) + 4], v[((j + 2 * (j / 4)) % 4) + 8],
> v[((j + 3 * (j / 4)) % 4) + 12]);

I think I'd look at doing [0..3] then [4..7] to save execution time.

I actually wonder how large a block you need to be processing to get
a gain from all that unrolling on architectures like x86-64.
The 'cold cache' timing must be horrid.
Never mind the side effects of displacing that much other code from the I-cache.

There aren't enough registers to hold all the v[] values so they'll
all be memory reads - and there are probably others as well.
The other instructions will happen in parallel - even 3 or 4 for each memory read.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)