RE: [PATCH 2/3] arm64: lib: improve copy performance when size is ge 128 bytes

From: David Laight
Date: Wed Mar 24 2021 - 12:39:30 EST


From: Robin Murphy
> Sent: 23 March 2021 12:09
>
> On 2021-03-23 07:34, Yang Yingliang wrote:
> > When copy over 128 bytes, src/dst is added after
> > each ldp/stp instruction, it will cost more time.
> > To improve this, we only add src/dst after load
> > or store 64 bytes.
>
> This breaks the required behaviour for copy_*_user(), since the fault
> handler expects the base address to be up-to-date at all times. Say
> you're copying 128 bytes and fault on the 4th store, it should return 80
> bytes not copied; the code below would return 128 bytes not copied, even
> though 48 bytes have actually been written to the destination.

Are there any non-superscaler amd64 cpu (that anyone cares about)?

If the cpu can execute multiple instructions in one clock
then it is usually possible to get the loop control (almost) free.

You might need to unroll once to interleave read/write
but any more may be pointless.
So something like:
a = *src++
do {
b = *src++;
*dst++ = a;
a = *src++;
*dst++ = b;
} while (src != lim);
*dst++ = b;

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)