Re: [PATCH v4 00/14] arm64: Optimise and update memcpy, user copy and string routines

From: Oli Swede
Date: Fri Sep 11 2020 - 11:19:04 EST


Hi Will and Catalin,

Thank you both for having a look at this, the coupling between the usercopy routines and the fixup routines was my main concern as well and I understand the desire to take an approach that avoids this dependency when it comes to importing new copy routines.

On 9/11/20 12:29 PM, Catalin Marinas wrote:
>
> I was a bit more optimistic about this series until I saw the
> copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
> only Oli (and maybe Robin) who understands it, so from a maintainer's
> perspective it doesn't really scale. It's also very fragile with any
> minor tweak in the actual copy routine potentially breaking the fixup
> code.
>

I was wondering if it would be possible to check that an alternative expression of the logic involved (e.g. diverted to a C function - if technically feasible - that is easier to understand and modify) would not really be desirable either?

On 9/11/20 12:29 PM, Catalin Marinas wrote:
> On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>>
>> I'm inclined to say that cortex-strings is probably not a good basis for
>> our uaccess routines. The code needs to be adapted in a non-straightforward
>> way so that we lose pretty much all of the benefits we'd usually get from
>> adopted an existing implementation; we can't pull in fixes or improvements
>> without a lot of manual effort, we can't reuse existing testing infrastructure
>> (see below) and we end up being a "second-class" user of the routines
>> because of the discrepancies in implementation.
>
> I was a bit more optimistic about this series until I saw the
> copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
> only Oli (and maybe Robin) who understands it, so from a maintainer's
> perspective it doesn't really scale. It's also very fragile with any
> minor tweak in the actual copy routine potentially breaking the fixup
> code.
>

An alternative approach to the fixup routines came up recently in discussions; it could potentially enable the copy template to be preserved in its current form and allow the uaccess routines to re-use a shared copy algorithm (including cortex-strings and/or future optimized out-of-order algorithms), and is based on one of Robin's suggestions from a few weeks ago (I found some time to prototype this locally the other day). The algorithm-specific backtracking would be dropped in favour of first- and second-stage fixups, with each of these being relatively short & straightforward, and (together) likely compatible with future copy routines.
More specifically, within the fixup path invoked after the initial fault, it could jump back N bytes and then begin an in-order byte-wise copy from within the first fixup, and then intentionally fault for the second time at either the same or another fault address. As the intermediate scan-copy should continue right up to the 'true' fault address, and ignore any out-of-order techniques used by the optimized algorithm, the fixup for the second fault could then contain an accurate, conclusive calculation of the return value.

A technique for this which would need to be carried forward from patch 11 in v4 would be to redirect the PC to a different 'intermediate' fixup (though only two choices this time) depending on whether the faulting instruction is a load or store, so it can establish if the minimum address to jump back to should be src or buf when comparing to (faddr-N). Additionally, for this the parameters to the usercopy functions would still need to be stored on the stack (patch 10 in v4).
Otherwise, it would involve a slight rewrite of v4 that could provide a more maintainable approach; the same fixups would be shared for all copy sizes & fault addresses (versus in v4) and the only variable subject to re-evaluation with new imports might be N (currently for copy sizes over 64B it is guaranteed anything below the previous 64B will have been copied). It could also remove the need to special-case UAO systems when compared to v4.

If it's possible for pages to be swapped out in the middle of a usercopy function then this method might not provide the extra implicit knowledge used by the fixups in v4 about the stage of the algorithm that has been reached (deduced from the fixup that has been routed to, implied by the fault-instruction-to-fixup mapping) to compensate for this, so it could under-report the bytes copied if it faults at an earlier address during the in-order (secondary) copy. This could be outweighed by the potential advantages mentioned above, and I mainly include this as v4 should account for this scenario (but perhaps unnecessarily if the scenario isn't possible or too unlikely :) ), but as you rightly point out it would be tiresome to re-write these routines.

On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>
> So why don't we use cortex-strings as a basis for the in-kernel routines
> only, preferably in a form where the code can be used directly and updated
> with a script (e.g. similar to how we pull in arch/arm64/crypto routines
> from OpenSSL). We can then roll our own uaccess routines, using a slightly
> more straight-forward implementation which is more amenable to handling
> user faults and doesn't do things like over copying.

I was wondering if the proposal described above is something that you would want to look into, or if the use of separate copy routines (script-based or otherwise) for in-kernel/uaccess is the preferred approach?

On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
>
> Can we put this test module into the kernel source tree, please, maybe as
> part of lkdtm? Given the control flow of these optimised functions, I think
> we absolutely need targeted testing to make sure we're getting complete
> coverage.

I will post the standalone module in its current state later today/early next week, the tests are exhaustive but if it would be useful to include a mechanism for specifying certain ranges of faults/copy sizes or improve the error reporting, I'd be happy to add these features (and/or integrate it into lkdtm).

Many thanks,
Oli


On 9/11/20 12:29 PM, Catalin Marinas wrote:
On Mon, Sep 07, 2020 at 11:10:03AM +0100, Will Deacon wrote:
On Wed, Jul 01, 2020 at 09:12:49AM +0100, Oli Swede wrote:
Version 3 addressed this but I later found some issues with the fixup
correctness after further testing, and have partially re-written them
here, and addressed some other behaviours of the copy algorithm.

[...]

I am waiting on access to the relevant machine before posting the benchmark
results for this optimized memcpy, but Sam reported the following with the
similar (but now slightly older) cortex-strings version:
* copy_from_user: 13.17%
* copy_to_user: 4.8%
* memcpy: 27.88%
* copy_in_user: Didn't appear in the test results.
This machine will also be used to check the fixups are accurate on a system
with UAO - they appear to be exact on a non-UAO system with PAN that I've
been working on locally.

I'm inclined to say that cortex-strings is probably not a good basis for
our uaccess routines. The code needs to be adapted in a non-straightforward
way so that we lose pretty much all of the benefits we'd usually get from
adopted an existing implementation; we can't pull in fixes or improvements
without a lot of manual effort, we can't reuse existing testing infrastructure
(see below) and we end up being a "second-class" user of the routines
because of the discrepancies in implementation.

I was a bit more optimistic about this series until I saw the
copy_user_fixup.S changes (patches 12 to 14). I have a suspicion it's
only Oli (and maybe Robin) who understands it, so from a maintainer's
perspective it doesn't really scale. It's also very fragile with any
minor tweak in the actual copy routine potentially breaking the fixup
code.

So why don't we use cortex-strings as a basis for the in-kernel routines
only, preferably in a form where the code can be used directly and updated
with a script (e.g. similar to how we pull in arch/arm64/crypto routines
from OpenSSL). We can then roll our own uaccess routines, using a slightly
more straight-forward implementation which is more amenable to handling
user faults and doesn't do things like over copying.

I think that's probably the best option. I wouldn't mind replacing the
in-kernel memcpy/strcpy etc. with these patches since the work was done
already but definitely not for the uaccess and fixup routines (we still
have the original implementation in the git log).

A script would work even better. Do we have any issue with licensing
though? Cortex Strings is BSD (3-clause IIRC) and copyright owned by
Linaro. I got them to officially agree with relicensing (dual license)
the latest copy under GPLv2 so that we can contribute it to the kernel.
But since the project license is still BSD, any future updates in there
are BSD-only.

Maybe someone who understands this stuff can confirm that it's ok to
regularly grab the Cortex Strings files into a GPLv2 codebase without
asking for Linaro's permission.

BTW, you could pick the kprobes patch in here, that explicit fixup call
is not necessary.