On 11/20/18 11:36 PM, Ingo Molnar wrote:
* Jens Axboe <axboe@xxxxxxxxx> wrote:
So this is a fun one... While I was doing the aio polled work, I noticed
that the submitting process spent a substantial amount of time copying
data to/from userspace. For aio, that's iocb and io_event, which are 64
and 32 bytes respectively. Looking closer at this, and it seems that
ERMS rep movsb is SLOWER for smaller copies, due to a higher startup
cost.
I came up with this hack to test it out, and low and behold, we now cut
the time spent in copying in half. 50% less.
Since these kinds of patches tend to lend themselves to bike shedding, I
also ran a string of kernel compilations out of RAM. Results are as
follows:
Patched : 62.86s avg, stddev 0.65s
Stock : 63.73s avg, stddev 0.67s
which would also seem to indicate that we're faster punting smaller
(< 128 byte) copies.
CPU: Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz
Interestingly, text size is smaller with the patch as well?!
I'm sure there are smarter ways to do this, but results look fairly
conclusive. FWIW, the behaviorial change was introduced by:
commit 954e482bde20b0e208fd4d34ef26e10afd194600
Author: Fenghua Yu <fenghua.yu@xxxxxxxxx>
Date: Thu May 24 18:19:45 2012 -0700
x86/copy_user_generic: Optimize copy_user_generic with CPU erms feature
which contains nothing in terms of benchmarking or results, just claims
that the new hotness is better.
Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
---
diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index a9d637bc301d..7dbb78827e64 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -29,16 +29,27 @@ copy_user_generic(void *to, const void *from, unsigned len)
{
unsigned ret;
+ /*
+ * For smaller copies, don't use ERMS as it's slower.
+ */
+ if (len < 128) {
+ alternative_call(copy_user_generic_unrolled,
+ copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
+ "=d" (len)),
+ "1" (to), "2" (from), "3" (len)
+ : "memory", "rcx", "r8", "r9", "r10", "r11");
+ return ret;
+ }
+
/*
* If CPU has ERMS feature, use copy_user_enhanced_fast_string.
* Otherwise, if CPU has rep_good feature, use copy_user_generic_string.
* Otherwise, use copy_user_generic_unrolled.
*/
alternative_call_2(copy_user_generic_unrolled,
- copy_user_generic_string,
- X86_FEATURE_REP_GOOD,
- copy_user_enhanced_fast_string,
- X86_FEATURE_ERMS,
+ copy_user_generic_string, X86_FEATURE_REP_GOOD,
+ copy_user_enhanced_fast_string, X86_FEATURE_ERMS,
ASM_OUTPUT2("=a" (ret), "=D" (to), "=S" (from),
"=d" (len)),
"1" (to), "2" (from), "3" (len)
So I'm inclined to do something like yours, because clearly the changelog
of 954e482bde20 was at least partly false: Intel can say whatever they
want, it's a fact that ERMS has high setup costs for low buffer sizes -
ERMS is optimized for large size, cache-aligned copies mainly.
I'm actually surprised that something like that was accepted, I guess
2012 was a simpler time :-)
But the result is counter-intuitive in terms of kernel text footprint,
plus the '128' is pretty arbitrary - we should at least try to come up
with a break-even point where manual copy is about as fast as ERMS - on
at least a single CPU ...
I did some more investigation yesterday, and found this:
commit 236222d39347e0e486010f10c1493e83dbbdfba8
Author: Paolo Abeni <pabeni@xxxxxxxxxx>
Date: Thu Jun 29 15:55:58 2017 +0200
x86/uaccess: Optimize copy_user_enhanced_fast_string() for short strings
which does attempt to rectify it, but only using ERMS for >= 64 byte copies.
At least for me, looks like the break even point is higher than that, which
would mean that something like the below would be more appropriate.