Re: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines
From: Oli Swede
Date: Wed Jul 01 2020 - 04:15:48 EST
Hi all,
(Apologies if you have already received this follow-up, please ignore
any previous duplicates; I am re-sending this with the lists in the CC
as the formatting filters caught me out :) Thanks ]
> Version 3 addressed this but I later found some issues with the fixup
> correctness after further testing, and have partially re-written them
> here, and addressed some other behaviours of the copy algorithm.
Further to the above description from the cover letter I have expanded
on this below.
The approach introduced in this version redirects the faulting
instruction to one of multiple fixup subroutines, depending on
properties of the instance of the instruction in the copy template: if
it specifies the target address relative to the start or end of the
buffer, and if the algorithm has reached a stage where the destination
address has been set to be 16-byte aligned (this is more relevant for
longer copies).
These two properties are both in relation to the copy algorithm's aim at
reducing the number of total instruction by making accesses that may
overlap, and thus copy the same data multiple times. This is in contrast
to the current algorithm which may need to branch to subroutines that
complete an odd number of bytes that isn't divisible by the initial unit
that was directed to. E.g. with the new algorithm, 17 bytes will
complete with two load and two store instructions, by coping the first
16 bytes from the start of the buffer, then the last 16 bytes to the
left of the end of the buffer.
There appear to be two types of overlapping copies within a source or
destination buffer:
1. If src is initially non-16-byte aligned, then it is re-set to (src &~
15), and the copying then resumes from aligned_src+16. In this scenario
any data between aligned_src+16 and src+16 could repeat in the
overlapping instruction.
2. If the target address is specified relative to the end of the buffer
(e.g. [srcend, -16]) then an overlap is taking place, so it should check
the fault address against the threshold known to have already been
copied up to.
In particular, encapsulating these properties through separate fixups
allows for:
a) Accurate rounding-down to the base of the block within which the
fault occurred.
The assumption here is that load and and store instructions are
single-copy atomic, so a fault in one instruction will not result in any
of the other bytes succeeding. For instance, the copy unit size could be
1, 4, 8, or 16, depending on the instruction (ldrb/strb, ldr/str, or
ldp/stp), but any threshold prior to which copying may have succeeded
will be the target address, i.e. the base of this block.
If the fixup routing itself implies certain information about the target
address, then the scenario shouldn't occur where it faults on an address
but the backtracking to the target is inaccurate (as a result of copy
overlaps within the buffer).
b) All data copied up to some address to be taken into account, in the
case where a page is swapped out immediately before the same address is
accessed as a result of an overlap.
This is somewhat related to the above, but addresses a specific
situation, where a page could be swapped out mid-way through a copy and
result in the same address faulting on its second copy, even if the
algorithm has succeeded up to a threshold address beyond the fault.
Within each fixup is the calculation for the number of bytes remaining
is performed, with the knowledge about the stage of the algorithm that
has been reached and any previous steps before the fault, and the
information retrieved from the stack, which contains the original
arguments to the usercopy function. It is then able to determine the
accurate return value. For systems implementing UAO, due to the ldp/stp
instructions being substituted for 2x ldtr/sttr instructions, a fault in
the second half of the doubleword needs to to take any completion of the
first half into account, and there are in-line modifications for this
(but they are not necessary for the load instruction fixups).
The categorization of the fixups based on the algorithm's behaviour may
also provide a general approach to replacing the fixups for future copy
imports, as they are fairly modular. However, there is an open question
as to whether it would be easy to replace them given this framework, as
the fixup subroutines are relatively long for the store instructions. I
have added comments to explain the calculations, but I am wondering if
it would be possible to divert the fixup logic to make it easier to
modify, if the aim is to reduce the coupling between the fixup routines
and the copy algorithm.
I am waiting on access to the relevant machine before posting the
benchmark results for this optimized memcpy, but Sam reported the
following with the similar (but now slightly older) cortex-strings version:
* copy_from_user: 13.17%
* copy_to_user: 4.8%
* memcpy: 27.88%
* copy_in_user: Didn't appear in the test results.
This machine will also be used to check the fixups are accurate on a
system with UAO - they appear to be exact on a non-UAO system with PAN
that I've been working on locally.
I should also mention that the correctness of these routines were tested
using a selftest test module akin to lib/test_user_copy.c (whose
usercopy functionality checks these patches do pass) but which is more
specific to the fixup accuracy, in that it compares the return value
with the true number of bytes remaining in the destination buffer at the
point of a fault.
Thanks in advance,
Oliver
On 6/30/20 8:48 PM, Oliver Swede wrote:
Hi,
This contains an update to the cortex-strings patchset: the
correctness of the fixup routines are improved, with the aim being to
return the exact number of remaining bytes for all copy sizes.
To ensure they are exact - which the current fixups are not for some
copy sizes and are off by a few byes - is an extension to the
original intention of fixing an issue reported by an LTP run last
year, where the fixup routine in v2 of this patchset (which was
importing the cortex-strings memcpy implementation) would over-report
the number of bytes that successfully copied.
Version 3 addressed this but I later found some issues with the fixup
correctness after further testing, and have partially re-written them
here, and addressed some other behaviours of the copy algorithm.
Comments welcome,
Thanks
Oliver
v1: https://lore.kernel.org/linux-arm-kernel/cover.1571073960.git.robin.murphy@xxxxxxx/
v2: https://lore.kernel.org/linux-arm-kernel/cover.1571421836.git.robin.murphy@xxxxxxx/
v3: https://lore.kernel.org/linux-arm-kernel/20200514143227.605-1-oli.swede@xxxxxxx/
Changes since v3:
* Improves the accuracy of the fixups in response to issues that
arose during futher testing
* Accounts for faults on store instructions on systems with UAO
enabled
* Expands on comments detailing the implementation
Changes since v2:
* Adds Robin's separate patch that fixes a compilation issue with
KProbes fixup [1]
* Imports the most recent memcpy implementation by updating Sam's
patch (and moves this patch to occur after the cortex-strings
importing so that it's closer to the patches containing its
corresponding fixups)
* Uses the stack to preserve the initial parameters
* Replaces the usercopy fixup routine in v2 with multiple longer
fixups that each make use of the fault address to return the exact
number of bytes that haven't yet copied.
[1] https://lore.kernel.org/linux-arm-kernel/e70f7b9de7e601b9e4a6fedad8eaf64d304b1637.1571326276.git.robin.murphy@xxxxxxx/
Oliver Swede (5):
arm64: Store the arguments to copy_*_user on the stack
arm64: Use additional memcpy macros and fixups
arm64: Add fixup routines for usercopy load exceptions
arm64: Add fixup routines for usercopy store exceptions
arm64: Improve accuracy of fixup for UAO cases
Robin Murphy (2):
arm64: kprobes: Drop open-coded exception fixup
arm64: Tidy up _asm_extable_faultaddr usage
Sam Tebbs (7):
arm64: Allow passing fault address to fixup handlers
arm64: Import latest version of Cortex Strings' memcmp
arm64: Import latest version of Cortex Strings' memmove
arm64: Import latest version of Cortex Strings' strcmp
arm64: Import latest version of Cortex Strings' strlen
arm64: Import latest version of Cortex Strings' strncmp
arm64: Import latest optimization of memcpy
arch/arm64/include/asm/alternative.h | 36 ---
arch/arm64/include/asm/assembler.h | 13 +
arch/arm64/include/asm/extable.h | 10 +-
arch/arm64/kernel/probes/kprobes.c | 7 -
arch/arm64/lib/copy_from_user.S | 272 +++++++++++++++--
arch/arm64/lib/copy_in_user.S | 287 ++++++++++++++++--
arch/arm64/lib/copy_template.S | 377 +++++++++++++----------
arch/arm64/lib/copy_template_user.S | 50 ++++
arch/arm64/lib/copy_to_user.S | 273 +++++++++++++++--
arch/arm64/lib/copy_user_fixup.S | 433 +++++++++++++++++++++++++++
arch/arm64/lib/memcmp.S | 333 ++++++++------------
arch/arm64/lib/memcpy.S | 127 ++++++--
arch/arm64/lib/memmove.S | 232 +++++---------
arch/arm64/lib/strcmp.S | 272 +++++++----------
arch/arm64/lib/strlen.S | 247 ++++++++++-----
arch/arm64/lib/strncmp.S | 363 ++++++++++------------
arch/arm64/mm/extable.c | 13 +-
arch/arm64/mm/fault.c | 2 +-
18 files changed, 2228 insertions(+), 1119 deletions(-)
create mode 100644 arch/arm64/lib/copy_template_user.S
create mode 100644 arch/arm64/lib/copy_user_fixup.S