Re: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes

From: Robin Murphy
Date: Wed Mar 24 2021 - 15:38:06 EST

Next message: Mark Brown: "Re: [PATCH v1 2/2] ASoC: amd: update spdx license for acp machine driver"
Previous message: Martin Blumenstingl: "[PATCH net] net: dsa: lantiq_gswip: Let GSWIP automatically set the xMII clock"
In reply to: David Laight: "RE: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2021-03-24 16:38, David Laight wrote:

From: Robin Murphy

Sent: 23 March 2021 12:09

On 2021-03-23 07:34, Yang Yingliang wrote:

When copy over 128 bytes, src/dst is added after
each ldp/stp instruction, it will cost more time.
To improve this, we only add src/dst after load
or store 64 bytes.

This breaks the required behaviour for copy_*_user(), since the fault
handler expects the base address to be up-to-date at all times. Say
you're copying 128 bytes and fault on the 4th store, it should return 80
bytes not copied; the code below would return 128 bytes not copied, even
though 48 bytes have actually been written to the destination.

Are there any non-superscaler amd64 cpu (that anyone cares about)?

If the cpu can execute multiple instructions in one clock
then it is usually possible to get the loop control (almost) free.

You might need to unroll once to interleave read/write
but any more may be pointless.

Nah, the whole point is that using post-increment addressing is crap in the first place because it introduces register dependencies between each access that could be avoided entirely if we could use offset addressing (and especially crap when we don't even *have* a post-index addressing mode for the unprivileged load/store instructions used in copy_*_user() and have to simulate it with extra instructions that throw off the code alignment).

We already have code that's tuned to work well across our microarchitectures[1], the issue is that butchering it to satisfy the additional requirements of copy_*_user() with a common template has hobbled regular memcpy() performance. I intend to have a crack at fixing that properly tomorrow ;)

Robin.

[1] https://github.com/ARM-software/optimized-routines

So something like:
a = *src++
do {
b = *src++;
*dst++ = a;
a = *src++;
*dst++ = b;
} while (src != lim);
*dst++ = b;

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Next message: Mark Brown: "Re: [PATCH v1 2/2] ASoC: amd: update spdx license for acp machine driver"
Previous message: Martin Blumenstingl: "[PATCH net] net: dsa: lantiq_gswip: Let GSWIP automatically set the xMII clock"
In reply to: David Laight: "RE: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]