Re: [PATCH] x86: only use ERMS for user copies for larger sizes

From: Jens Axboe
Date: Wed Nov 21 2018 - 08:32:33 EST


On 11/20/18 11:36 PM, Ingo Molnar wrote:
>
> [ Cc:-ed a few other gents and lkml. ]
>
> * Jens Axboe <axboe@xxxxxxxxx> wrote:
>
>> Hi,
>>
>> So this is a fun one... While I was doing the aio polled work, I noticed
>> that the submitting process spent a substantial amount of time copying
>> data to/from userspace. For aio, that's iocb and io_event, which are 64
>> and 32 bytes respectively. Looking closer at this, and it seems that
>> ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
>> cost.
>>
>> I came up with this hack to test it out, and low and behold, we now cut
>> the time spent in copying in half. 50% less.
>>
>> Since these kinds of patches tend to lend themselves to bike shedding, I
>> also ran a string of kernel compilations out of RAM. Results are as
>> follows:
>>
>> Patched : 62.86s avg, stddev 0.65s
>> Stock : 63.73s avg, stddev 0.67s
>>
>> which would also seem to indicate that we're faster punting smaller
>> (< 128 byte) copies.
>>
>> CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
>>
>> Interestingly, text size is smaller with the patch as well?!
>>
>> I'm sure there are smarter ways to do this, but results look fairly
>> conclusive. FWIW, the behaviorial change was introduced by:
>>
>> commit 954e482bde20b0e208fd4d34ef26e10afd194600
>> Author: Fenghua Yu <fenghua.yu@xxxxxxxxx>
>> Date: Thu May 24 18:19:45 2012 -0700
>>
>> x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
>>
>> which contains nothing in terms of benchmarking or results, just claims
>> that the new hotness is better.
>>
>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>> ---
>>
>> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
>> index a9d637bc301d..7dbb78827e64 100644
>> --- a/arch/x86/include/asm/uaccess_64.h
>> +++ b/arch/x86/include/asm/uaccess_64.h
>> @@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
>> {
>> unsigned ret;
>>
>> + /*
>> + * For smaller copies, don't use ERMS as it's slower.
>> + */
>> + if (len < 128) {
>> + alternative_call(copy_user_generic_unrolled,
>> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
>> + ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>> + "=d" (len)),
>> + "1" (to), "2" (from), "3" (len)
>> + : "memory", "rcx", "r8", "r9", "r10", "r11");
>> + return ret;
>> + }
>> +
>> /*
>> * If CPU has ERMS feature, use copy_user_enhanced_fast_string.
>> * Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
>> * Otherwise, use copy_user_generic_unrolled.
>> */
>> alternative_call_2(copy_user_generic_unrolled,
>> - copy_user_generic_string,
>> - X86_FEATURE_REP_GOOD,
>> - copy_user_enhanced_fast_string,
>> - X86_FEATURE_ERMS,
>> + copy_user_generic_string, X86_FEATURE_REP_GOOD,
>> + copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
>> ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
>> "=d" (len)),
>> "1" (to), "2" (from), "3" (len)
>
> So I'm inclined to do something like yours, because clearly the changelog
> of 954e482bde20 was at least partly false: Intel can say whatever they
> want, it's a fact that ERMS has high setup costs for low buffer sizes -
> ERMS is optimized for large size, cache-aligned copies mainly.

I'm actually surprised that something like that was accepted, I guess
2012 was a simpler time :-)

> But the result is counter-intuitive in terms of kernel text footprint,
> plus the '128' is pretty arbitrary - we should at least try to come up
> with a break-even point where manual copy is about as fast as ERMS - on
> at least a single CPU ...

I did some more investigation yesterday, and found this:

commit 236222d39347e0e486010f10c1493e83dbbdfba8
Author: Paolo Abeni <pabeni@xxxxxxxxxx>
Date: Thu Jun 29 15:55:58 2017 +0200

x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings

which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
At least for me, looks like the break even point is higher than that, which
would mean that something like the below would be more appropriate. Adding
Paolo, in case he actually wrote a test app for this. In terms of test app,
I'm always weary of using microbenchmarking only for this type of thing,
real world testing (if stable enough) is much more useful.


diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index db4e5aa0858b..21c4d68c5fac 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -175,8 +175,8 @@ EXPORT_SYMBOL(copy_user_generic_string)
*/
ENTRY(copy_user_enhanced_fast_string)
ASM_STAC
- cmpl $64,%edx
- jb .L_copy_short_string /* less then 64 bytes, avoid the costly 'rep' */
+ cmpl $128,%edx
+ jb .L_copy_short_string /* less then 128 bytes, avoid costly 'rep' */
movl %edx,%ecx
1: rep
movsb

--
Jens Axboe