Re: [PATCH v5 08/14] arm64: Import latest optimization of memcpy

From: Robin Murphy
Date: Tue Jun 01 2021 - 08:06:44 EST


On 2021-06-01 11:03, Sunil Kovvuri wrote:
On Mon, Sep 14, 2020 at 8:44 PM Oliver Swede <oli.swede@xxxxxxx> wrote:

From: Sam Tebbs <sam.tebbs@xxxxxxx>

Import the latest memcpy implementation into memcpy,
copy_{from, to and in}_user.
The implementation of the user routines is separated into two forms:
one for when UAO is enabled and one for when UAO is disabled, with
the two being chosen between with a runtime patch.
This avoids executing the many NOPs emitted when UAO is disabled.

The project containing optimized implementations for various library
functions has now been renamed from 'cortex-strings' to
'optimized-routines', and the new upstream source is
string/aarch64/memcpy.S as of commit 4c175c8be12 in
https://github.com/ARM-software/optimized-routines.

Signed-off-by: Sam Tebbs <sam.tebbs@xxxxxxx>
[ rm: add UAO fixups, streamline copy_exit paths, expand commit message ]
Signed-off-by: Robin Murphy <robin.murphy@xxxxxxx>
[ os: import newer memcpy algorithm, update commit message ]
Signed-off-by: Oliver Swede <oli.swede@xxxxxxx>
---
arch/arm64/include/asm/alternative.h | 36 ---
arch/arm64/lib/copy_from_user.S | 113 ++++++--
arch/arm64/lib/copy_in_user.S | 129 +++++++--
arch/arm64/lib/copy_template.S | 375 +++++++++++++++------------
arch/arm64/lib/copy_template_user.S | 24 ++
arch/arm64/lib/copy_to_user.S | 112 ++++++--
arch/arm64/lib/copy_user_fixup.S | 14 +
arch/arm64/lib/memcpy.S | 47 ++--
8 files changed, 557 insertions(+), 293 deletions(-)
create mode 100644 arch/arm64/lib/copy_template_user.S
create mode 100644 arch/arm64/lib/copy_user_fixup.S

Do you have any performance data with this patch ?
I see these patches are still not pushed to mainline, any reasons ?

Funny you should pick up on the 6-month-old thread days after I've been posting new versions of the relevant parts[1] :)

I think this series mostly stalled on the complexity of the usercopy parts, which then turned into even more of a moving target anyway, hence why I decided to split it up.

Also curious to know why 128bit registers are not considered, similar to
https://android.googlesource.com/platform/bionic.git/+/a71b4c3f144a516826e8ac5b262099b920c49ce0/libc/arch-arm64/generic-neon/bionic/memcpy.S

The overhead of kernel_neon_begin() etc. is significant, and usually only worth it in places like the crypto routines where there's enough benefit from actual ASIMD computation to outweigh the save/restore cost. On smaller cores where the L1 interface is only 128 bits wide anyway there is no possible gain in memcpy() throughput to ever offset that cost, and even for wider microarchitectures it's only likely to start breaking even at relatively large copy sizes. Plus we can't necessarily assume the ASIMD registers are even present (apparently the lack of a soft-float ABI hasn't stopped people from wanting to run Linux on such systems...)

Robin.

[1] https://lore.kernel.org/linux-arm-kernel/cover.1622128527.git.robin.murphy@xxxxxxx/