Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest

From: Doug Anderson
Date: Tue Sep 22 2020 - 20:39:41 EST


On Mon, Sep 21, 2020 at 11:25 PM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
>
> On Tue, 22 Sep 2020 at 02:27, Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > On every boot time we see messages like this:
> >
> > [ 0.025360] calling calibrate_xor_blocks+0x0/0x134 @ 1
> > [ 0.025363] xor: measuring software checksum speed
> > [ 0.035351] 8regs : 3952.000 MB/sec
> > [ 0.045384] 32regs : 4860.000 MB/sec
> > [ 0.055418] arm64_neon: 5900.000 MB/sec
> > [ 0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > [ 0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> >
> > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > again, the arm64_neon implementation is the fastest way to do XOR.
> > ...and the above is on a system with HZ=1000. Due to the way the
> > testing happens, if we have HZ defined to something slower it'll take
> > much longer. HZ=100 means we spend 300 ms on every boot re-confirming
> > a fact that will be the same for every bootup.
> >
> > Trying to super-optimize the xor operation makes a lot of sense if
> > you're using software RAID, but the above is probably not worth it for
> > most Linux users because:
> > 1. Quite a few arm64 kernels are built for embedded systems where
> > software raid isn't common. That means we're spending lots of time
> > on every boot trying to optimize something we don't use.
> > 2. Presumably, if we have neon, it's faster than alternatives. If
> > it's not, it's not expected to be tons slower.
> > 3. Quite a lot of arm64 systems are big.LITTLE. This means that the
> > existing test is somewhat misguided because it's assuming that test
> > results on the boot CPU apply to the other CPUs in the system.
> > This is not necessarily the case.
> >
> > Let's add a new config option that allows us to just use the neon
> > functions (if present) without benchmarking.
> >
> > NOTE: One small side effect is that on an arm64 system _without_ neon
> > we'll end up testing the xor_block_8regs_p and xor_block_32regs_p
> > versions of the function. That's presumably OK since we already test
> > all those when KERNEL_MODE_NEON is disabled.
> >
> > ALSO NOTE: presumably the way to do better than this is to add some
> > sort of per-CPU-core lookup table and jump to a per-CPU-core-specific
> > XOR function each time xor is called. Without seeing evidence that
> > this would really help someone, though, that doesn't seem worth it.
> >
> > Signed-off-by: Douglas Anderson <dianders@xxxxxxxxxxxx>
>
> On the two arm64 machines that I happen to have running right now, I get
>
> SynQuacer (Cortex-A53)
>
> 8regs : 1917.000 MB/sec
> 32regs : 2270.000 MB/sec
> arm64_neon: 2053.000 MB/sec
>
> ThunderX2
>
> 8regs : 10170.000 MB/sec
> 32regs : 12051.000 MB/sec
> arm64_neon: 10948.000 MB/sec
>
> so your assertion is not entirely valid.

OK, good to know.


> If the system does not need XOR, it is free not to load the module, so
> there is no reason it has to affect the boot time.

The fact that it was run super early somehow made me just assume that
this couldn't be a module, but of course you're right that it can be a
module. That works for me and saves me my precious boot time. ;-)

That being said, this'll still bite anyone who wants to build this in
for whatever reason. I'll respond to your other email with more...