Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest

From: Ard Biesheuvel
Date: Wed Sep 23 2020 - 06:15:13 EST

Next message: Tang Bin: "[PATCH] net: mdio: Remove redundant parameter and check"
Previous message: Artem Lapkin: "[PATCH 1/2] arm64: dts: rockchip: rk3399-khadas-edge: add missed spiflash node"
In reply to: Doug Anderson: "Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 23 Sep 2020 at 02:39, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> Hi,
>
> On Tue, Sep 22, 2020 at 3:30 AM Ard Biesheuvel <ardb@xxxxxxxxxx> wrote:
> >
> > On Tue, 22 Sep 2020 at 10:26, David Laight <David.Laight@xxxxxxxxxx> wrote:
> > >
> > > From: Douglas Anderson
> > > > Sent: 22 September 2020 01:26
> > > >
> > > > On every boot time we see messages like this:
> > > >
> > > > [ 0.025360] calling calibrate_xor_blocks+0x0/0x134 @ 1
> > > > [ 0.025363] xor: measuring software checksum speed
> > > > [ 0.035351] 8regs : 3952.000 MB/sec
> > > > [ 0.045384] 32regs : 4860.000 MB/sec
> > > > [ 0.055418] arm64_neon: 5900.000 MB/sec
> > > > [ 0.055423] xor: using function: arm64_neon (5900.000 MB/sec)
> > > > [ 0.055433] initcall calibrate_xor_blocks+0x0/0x134 returned 0 after 29296 usecs
> > > >
> > > > As you can see, we spend 30 ms on every boot re-confirming that, yet
> > > > again, the arm64_neon implementation is the fastest way to do XOR.
> > > > ...and the above is on a system with HZ=1000. Due to the way the
> > > > testing happens, if we have HZ defined to something slower it'll take
> > > > much longer. HZ=100 means we spend 300 ms on every boot re-confirming
> > > > a fact that will be the same for every bootup.
> > >
> > > Can't the code use a TSC (or similar high-res counter) to
> > > see how long it takes to process a short 'hot cache' block?
> > > That wouldn't take long at all.
> > >
> >
> > This is generic code that runs from an core_initcall() so I am not
> > sure we can easily implement this in a portable way.
>
> If it ran later, presumably you could just use ktime? That seems like
> it'd be a portable enough way?
>

That should work, I suppose. That should also permit us to simply time
N iterations of the benchmark instead of running it as many times as
we can while waiting for a jiffy to elapse.

>
> > Doug: would it help if we deferred this until late_initcall()? We
> > could take an arbitrary pick from the list at core_initcall() time to
> > serve early users, and update to the fastest one at a later time.
>
> Yeah, I think that'd work OK. One advantage of it being later would
> be that it could run in parallel to other things that were happening
> in the system (anyone who enabled async probe on their driver). Even
> better would be if your code itself could run async and not block the
> rest of boot. ;-)

My code? :-)

> I do like the idea that we could just arbitrarily
> pick one implementation until we've calibrated. I guess we'd want to
> figure out how to do this lockless but it shouldn't be too hard to
> just check to see if a single pointer is non-NULL and once it becomes
> non-NULL then you can use it... ...or a pointer plus a sentinel if
> writing the pointer can't be done atomically...
>

Surely, any SMP capable architecture that cares about atomicity at
that level can update a function pointer, which is guaranteed to be
the native word size, without tearing?

This should do it afaict:

--- a/crypto/xor.c
+++ b/crypto/xor.c
@@ -21,7 +21,7 @@
#endif

/* The xor routines to use. */
-static struct xor_block_template *active_template;
+static struct xor_block_template *active_template = xor_block_8regs;

void
xor_blocks(unsigned int src_count, unsigned int bytes, void *dest, void **srcs)
@@ -150,6 +150,5 @@ static __exit void xor_exit(void) { }

MODULE_LICENSE("GPL");

-/* when built-in xor.o must initialize before drivers/md/md.o */
-core_initcall(calibrate_xor_blocks);
+late_initcall(calibrate_xor_blocks);
module_exit(xor_exit);

> It also feels like with the large number of big.LITTLE systems out
> there you'd either want a lookup table per core or you'd want to do
> calibration per core.
>

I don't think the complexity is worth it, tbh, as there are too many
parameters to consider, although it would be nice if we could run the
benchmark on the best performing CPU (as that is where the scheduler
will run the code if it is on a sufficiently hot path, and if it is
not, it doesn't really matter)

Next message: Tang Bin: "[PATCH] net: mdio: Remove redundant parameter and check"
Previous message: Artem Lapkin: "[PATCH 1/2] arm64: dts: rockchip: rk3399-khadas-edge: add missed spiflash node"
In reply to: Doug Anderson: "Re: [PATCH] arm64: crypto: Add an option to assume NEON XOR is the fastest"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]