Re: [PATCH crypto-stable] crypto: arch/lib - limit simd usage to PAGE_SIZE chunks

From: Ard Biesheuvel
Date: Thu Apr 23 2020 - 04:46:14 EST


On Wed, 22 Apr 2020 at 22:17, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> On Wed, Apr 22, 2020 at 1:51 PM Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> >
> > On Wed, Apr 22, 2020 at 1:39 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 22 Apr 2020 at 09:32, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
> > > >
> > > > On Tue, Apr 21, 2020 at 10:04 PM Eric Biggers <ebiggers@xxxxxxxxxx> wrote:
> > > > > Seems this should just be a 'while' loop?
> > > > >
> > > > > while (bytes) {
> > > > > unsigned int todo = min_t(unsigned int, PAGE_SIZE, bytes);
> > > > >
> > > > > kernel_neon_begin();
> > > > > chacha_doneon(state, dst, src, todo, nrounds);
> > > > > kernel_neon_end();
> > > > >
> > > > > bytes -= todo;
> > > > > src += todo;
> > > > > dst += todo;
> > > > > }
> > > >
> > > > The for(;;) is how it's done elsewhere in the kernel (that this patch
> > > > doesn't touch), because then we can break out of the loop before
> > > > having to increment src and dst unnecessarily. Likely a pointless
> > > > optimization as probably the compiler can figure out how to avoid
> > > > that. But maybe it can't. If you have a strong preference, I can
> > > > reactor everything to use `while (bytes)`, but if you don't care,
> > > > let's keep this as-is. Opinion?
> > > >
> > >
> > > Since we're bikeshedding, I'd prefer 'do { } while (bytes);' here,
> > > given that bytes is guaranteed to be non-zero before we enter the
> > > loop. But in any case, I'd prefer avoiding for(;;) or while(1) where
> > > we can.
> >
> > Okay, will do-while it up for v2.
>
> I just sent v2 containing do-while, and I'm fine with that going in
> that way. But just in the interest of curiosity in the pan-tone
> palette, check this out:
>
> https://godbolt.org/z/VxXien
>
> It looks like on mine, the compiler avoids unnecessarily calling those
> adds on the last iteration, but on the other hand, it results in an
> otherwise unnecessary unconditional jump for the < 4096 case. Sort of
> interesting. Arm64 code is more or less the same difference too.

Yeah, even if shaving off 1 or 2 cycles mattered here (since we've
just decided that ugh() may take up to 20,000 cycles), hiding a couple
of ALU instructions in the slots between the subs (which sets the zero
flag) and the conditional branch that tests it probably comes for free
on in-order cores anyway. And even if it didn't, backwards branches
are usually statically predicted as taken, in which case their results
are actually needed.

On out-of-order cores under speculation, none of this matters anyway.