Re: x86 copy performance regression
From: Linus Torvalds
Date: Fri May 26 2023 - 12:30:25 EST
On Fri, May 26, 2023 at 8:00 AM Eric Dumazet <edumazet@xxxxxxxxxx> wrote:
>
> We can see rep_movs_alternative() using more cycles in kernel profiles
> than the previous variant (copy_user_enhanced_fast_string, which was
> simply using "rep movsb"), and we can not reach line rate (as we
> could before the series)
Hmm. I assume the attached patch ends up fixing the regression?
That hack to generate the two-byte 'jae' instruction even for the
alternative is admittedly not pretty, but I just couldn't deal with
the alternative that generated pointlessly bad code.
We could make the constant in the comparison depend on whether it is
for the unrolled or for the erms case too, I guess, but I think erms
is probably "good enough" with 64-byte copies.
I was really hoping we could avoid this, but hey, a regression is a regression.
Can you verify this patch fixes things for you?
Linus
arch/x86/lib/copy_user_64.S | 16 +++++++++++++++-
1 file changed, 15 insertions(+), 1 deletion(-)
diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 4fc5c2de2de4..21f11bd36cdc 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -7,9 +7,17 @@
*/
#include <linux/linkage.h>
+#include <asm/alternative.h>
#include <asm/asm.h>
#include <asm/export.h>
+/*
+ * Disgusting hack to generate a two-byte 'jae' instruction
+ * for 'alternatives' that would otherwise generate a relocation
+ * and a big jump.
+ */
+#define JAE(x) ".byte 0x73," #x "-0b-2"
+
/*
* rep_movs_alternative - memory copy with exception handling.
* This version is for CPUs that don't have FSRM (Fast Short Rep Movs)
@@ -29,7 +37,7 @@
*/
SYM_FUNC_START(rep_movs_alternative)
cmpq $64,%rcx
- jae .Lunrolled
+0: alternative JAE(.Lunrolled), JAE(.Llarge), X86_FEATURE_ERMS
cmp $8,%ecx
jae .Lword
@@ -65,6 +73,12 @@ SYM_FUNC_START(rep_movs_alternative)
_ASM_EXTABLE_UA( 2b, .Lcopy_user_tail)
_ASM_EXTABLE_UA( 3b, .Lcopy_user_tail)
+.Llarge:
+0: rep movsb
+1: RET
+
+ _ASM_EXTABLE_UA( 0b, 1b)
+
.p2align 4
.Lunrolled:
10: movq (%rsi),%r8