[RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
From: Mateusz Guzik
Date: Wed Apr 02 2025 - 09:45:17 EST
Not a real submission yet as I would like results from other people.
tl;dr when benchmarking compilation of a hello-world program I'm getting
a 1.7% increase in throughput on Sapphire Rapids when convincing the
compiler to only use regular stores for inlined memset and memcpy
Note this uarch does have FSRM and still benefits from not using it for
some cases.
I am not in position to bench this on other CPUs, would be nice if
someone did it on AMD.
Onto the business:
The kernel is chock full of inlined rep movsq and rep stosq, including
in hot paths and these are known to be detrimental to performance below
certain sizes.
Most notably in sync_regs:
<+0>: endbr64
<+4>: mov %gs:0x22ca5d4(%rip),%rax # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>: mov %rdi,%rsi
<+15>: sub $0xa8,%rax
<+21>: cmp %rdi,%rax
<+24>: je 0xffffffff82244a55 <sync_regs+37>
<+26>: mov $0x15,%ecx
<+31>: mov %rax,%rdi
<+34>: rep movsq %ds:(%rsi),%es:(%rdi)
<+37>: jmp 0xffffffff82256ba0 <__x86_return_thunk>
When issuing hello-world compiles in a loop this is over 1% of total CPU
time as reported by perf. With the kernel recompiled to instead do a
copy with regular stores this drops to 0.13%.
Recompiled it looks like this:
<+0>: endbr64
<+4>: mov %gs:0x22b9f44(%rip),%rax # 0xffffffff8450f010 <cpu_current_top_of_stack>
<+12>: sub $0xa8,%rax
<+18>: cmp %rdi,%rax
<+21>: je 0xffffffff82255114 <sync_regs+84>
<+23>: xor %ecx,%ecx
<+25>: mov %ecx,%edx
<+27>: add $0x20,%ecx
<+30>: mov (%rdi,%rdx,1),%r10
<+34>: mov 0x8(%rdi,%rdx,1),%r9
<+39>: mov 0x10(%rdi,%rdx,1),%r8
<+44>: mov 0x18(%rdi,%rdx,1),%rsi
<+49>: mov %r10,(%rax,%rdx,1)
<+53>: mov %r9,0x8(%rax,%rdx,1)
<+58>: mov %r8,0x10(%rax,%rdx,1)
<+63>: mov %rsi,0x18(%rax,%rdx,1)
<+68>: cmp $0xa0,%ecx
<+74>: jb 0xffffffff822550d9 <sync_regs+25>
<+76>: mov (%rdi,%rcx,1),%rdx
<+80>: mov %rdx,(%rax,%rcx,1)
<+84>: jmp 0xffffffff822673e0 <__x86_return_thunk>
bloat-o-meter says:
Total: Before=30021301, After=30089151, chg +0.23%
There are of course other spots which are modified and they also see a
reduction in time spent.
Bench results in compilations completed in a 10 second period with /tmp
backed by tmpfs:
before:
978 ops (97 ops/s)
979 ops (97 ops/s)
978 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
979 ops (97 ops/s)
after:
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
997 ops (99 ops/s)
996 ops (99 ops/s)
I'm running this with debian 12 userspace (gcc 12.2.0).
I asked the LKP folk to bench but did not get a response yet:
https://lore.kernel.org/oe-lkp/CAGudoHHd8TkyA1kOQ2KtZdZJ2VxUW=2mP-JR0t_oR07TfrwN8w@xxxxxxxxxxxxxx/
Repro instructions:
for i in $(seq 1 10); do taskset --cpu-list 1 ./ccbench 10; done
taskset is important as otherwise processes roam around the box big
time.
Attached files are:
- cc.c for will-it-scale if someone wants to profile the thing while it
loops indefinitely
- src0.c -- hello world for reference, plop into /src/src0.c
- ccbench.c is the bench; compile with cc -O2 -o ccbench ccbench.c
It spawns gcc through system() forcing it to go through the shell, which
mimicks what happens when compiling with make.
arch/x86/Makefile | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/arch/x86/Makefile b/arch/x86/Makefile
index 9b76e77ff7f7..1a1afcc3041f 100644
--- a/arch/x86/Makefile
+++ b/arch/x86/Makefile
@@ -198,6 +198,29 @@ ifeq ($(CONFIG_STACKPROTECTOR),y)
endif
endif
+ifdef CONFIG_CC_IS_GCC
+#
+# Inline memcpy and memset handling policy for gcc.
+#
+# For ops of sizes known at compilation time it quickly resorts to issuing rep
+# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
+# latency and it is faster to issue regular stores (even if in loops) to handle
+# small buffers.
+#
+# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
+# reported 0.23% increase for enabling these.
+#
+# We inline up to 256 bytes, which in the best case issues few movs, in the
+# worst case creates a 4 * 8 store loop.
+#
+# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
+# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
+# common denominator. Someone(tm) should revisit this from time to time.
+#
+KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+
#
# If the function graph tracer is used with mcount instead of fentry,
# '-maccumulate-outgoing-args' is needed to prevent a GCC bug
--
2.43.0
#include <sys/types.h>
#include <err.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define RUNCMD "gcc -c -o /tmp/out0.o /src/src0.c"
#define WARMUP 5
volatile sig_atomic_t got_alarm;
static void sigalrm_handler(int signo)
{
got_alarm = 1;
}
int main(int argc, char **argv)
{
long i;
int n;
if (argc != 2) {
errx(1, "need time limit in seconds");
}
n = atoi(argv[1]);
if (n < 1) {
errx(1, "bad arg");
}
signal(SIGALRM, sigalrm_handler);
if (!getenv("CCBENCH_SKIP_WARMUP")) {
alarm(WARMUP);
for (i = 0; !got_alarm; i++) {
system(RUNCMD);
}
printf("warmup: %ld ops (%ld ops/s)\n", i, i / WARMUP);
got_alarm = 0;
}
alarm(n);
for (i = 0; !got_alarm; i++) {
system(RUNCMD);
}
printf("bench: %ld ops (%ld ops/s)\n", i, i / n);
}
#include <stdio.h>
int main(void)
{
printf("Hello world!\n");
}
#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
char *testcase_description = "compile";
void testcase(unsigned long long *iterations, unsigned long nr)
{
char cmd[1024];
sprintf(cmd, "cc -c -o /tmp/out.%ld /src/src%ld.c", nr, nr);
while (1) {
system(cmd);
(*iterations)++;
}
}