Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest

From: Doug Anderson
Date: Tue Sep 22 2020 - 20:39:47 EST


Hi,

On Tue, Sep 22, 2020 at 3:30 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>
> On Tue, 22 Sep 2020 at 10:26, David Laight <David.Laight@xxxxxxxxxx> wrote:
> >
> > From: Douglas Anderson
> > > Sent: 22 September 2020 01:26
> > >
> > > On every boot time we see messages like this:
> > >
> > > [ 0.025360] calling calibrate_xor_blocks+0x0/0x134 @ 1
> > > [ 0.025363] xor: measuring software checksum speed
> > > [ 0.035351] 8regs : 3952.000 MB/sec
> > > [ 0.045384] 32regs : 4860.000 MB/sec
> > > [ 0.055418] arm64_neon: 5900.000 MB/sec
> > > [ 0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > > [ 0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> > >
> > > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > > again, the arm64_neon implementation is the fastest way to do XOR.
> > > ...and the above is on a system with HZ=1000. Due to the way the
> > > testing happens, if we have HZ defined to something slower it'll take
> > > much longer. HZ=100 means we spend 300 ms on every boot re-confirming
> > > a fact that will be the same for every bootup.
> >
> > Can't the code use a TSC (or similar high-res counter) to
> > see how long it takes to process a short 'hot cache' block?
> > That wouldn't take long at all.
> >
>
> This is generic code that runs from an core_initcall() so I am not
> sure we can easily implement this in a portable way.

If it ran later, presumably you could just use ktime? That seems like
it'd be a portable enough way?


> Doug: would it help if we deferred this until late_initcall()? We
> could take an arbitrary pick from the list at core_initcall() time to
> serve early users, and update to the fastest one at a later time.

Yeah, I think that'd work OK. One advantage of it being later would
be that it could run in parallel to other things that were happening
in the system (anyone who enabled async probe on their driver). Even
better would be if your code itself could run async and not block the
rest of boot. ;-) I do like the idea that we could just arbitrarily
pick one implementation until we've calibrated. I guess we'd want to
figure out how to do this lockless but it shouldn't be too hard to
just check to see if a single pointer is non-NULL and once it becomes
non-NULL then you can use it... ...or a pointer plus a sentinel if
writing the pointer can't be done atomically...

It also feels like with the large number of big.LITTLE systems out
there you'd either want a lookup table per core or you'd want to do
calibration per core.

-Doug