Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

From: Mateusz Guzik
Date: Sun Apr 13 2025 - 06:27:33 EST

Next message: FUJITA Tomonori: "Re: [PATCH v12 5/5] MAINTAINERS: rust: Add a new section for all of the time stuff"
Previous message: Ivan Vecera: "Re: [PATCH v2 06/14] mfd: zl3073x: Add macros for device registers access"
Next in thread: David Laight: "Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Apr 2, 2025 at 6:27 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> On Wed, Apr 2, 2025 at 6:22 PM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > >
> > >
> > > +ifdef CONFIG_CC_IS_GCC
> > > +#
> > > +# Inline memcpy and memset handling policy for gcc.
> > > +#
> > > +# For ops of sizes known at compilation time it quickly resorts to issuing rep
> > > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
> > > +# latency and it is faster to issue regular stores (even if in loops) to handle
> > > +# small buffers.
> > > +#
> > > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
> > > +# reported 0.23% increase for enabling these.
> > > +#
> > > +# We inline up to 256 bytes, which in the best case issues few movs, in the
> > > +# worst case creates a 4 * 8 store loop.
> > > +#
> > > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
> > > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
> > > +# common denominator. Someone(tm) should revisit this from time to time.
> > > +#
> > > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > +endif
> >
> > Please make this a gcc bug-report instead - I really don't want to
> > have random compiler-specific tuning options in the kernel.
> >
> > Because that whole memcpy-strategy thing is something that gets tuned
> > by a lot of other compiler options (ie -march and different versions).
> >
>
> Ok.

So I reported this upstream:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596

And found some other problems in the meantime:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704

Looks like this particular bit was persisting for quite some time now.

I also confirmed there is a benefit on AMD CPUs.

I added a new bench: page faults of an area 2MB - 4096 (as in just shy
of a huge page). With this I'm seeing 17% increase in throughput.

Profile before shows sync_regs at 11.81%, after it drops down to below 1%(!)

That is to say, the 'movsq' and 'stosq' codegen is really harmful and
I think it would be a shame to let it linger given how easy it is to
avoid.

Even if the gcc folk address this soon(tm), users wont be able to
benefit from it for quite some time.

I think it is a fair point that the patch as posted completely ignored
-mtune, but this particular aspect is trivially remedied by gating it
like so:
+ifdef CONFIG_CC_IS_GCC
+ifndef CONFIG_X86_NATIVE_CPU
+ KBUILD_CFLAGS +=
-mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+ KBUILD_CFLAGS +=
-mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
+endif
+endif

Given the above, would you be ok with allowing the patch into the tree
as a temporary measure? It can be gated on compiler version later on
(if gcc folk fix the problem) or straight up removed.

Here is perf top from that page fault test, before:
22.07% [kernel] [k] asm_exc_page_fault
12.83% pf_processes [.] testcase
11.81% [kernel] [k] sync_regs
7.55% [kernel] [k] _raw_spin_lock
2.32% [kernel] [k] __handle_mm_fault
2.27% [kernel] [k] mas_walk
2.15% [kernel] [k]
__raw_callee_save___pv_queued_spin_unlock
2.02% [kernel] [k] clear_page_erms
1.98% [kernel] [k] __mod_memcg_lruvec_state
1.62% [kernel] [k] lru_add
1.60% [kernel] [k] do_anonymous_page
1.39% [kernel] [k] folios_put_refs
1.39% [kernel] [k] unmap_page_range
1.28% [kernel] [k] do_user_addr_fault
1.22% [kernel] [k] __lruvec_stat_mod_folio
1.21% [kernel] [k] get_page_from_freelist
1.20% [kernel] [k] lock_vma_under_rcu

and after:
26.06% [kernel] [k] asm_exc_page_fault
13.18% pf_processes [.] testcase
8.53% [kernel] [k] _raw_spin_lock
2.19% [kernel] [k] __mod_memcg_lruvec_state
2.17% [kernel] [k]
__raw_callee_save___pv_queued_spin_unlock
2.15% [kernel] [k] clear_page_erms
2.13% [kernel] [k] __handle_mm_fault
1.94% [kernel] [k] do_anonymous_page
1.93% [kernel] [k] mas_walk
1.68% [kernel] [k] lru_add
1.65% [kernel] [k] __lruvec_stat_mod_folio
1.53% [kernel] [k] folios_put_refs
1.53% [kernel] [k] get_page_from_freelist
1.45% [kernel] [k] unmap_page_range
1.33% [kernel] [k] do_user_addr_fault
1.19% [kernel] [k] lock_vma_under_rcu
1.18% [kernel] [k] error_entry
[..]
0.91% [kernel] [k] sync_regs

bench pluggable into will-it-scale:
#include <unistd.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <assert.h>

#define MEMSIZE ((2 * 1024 * 1024) - 4096)

char *testcase_description = "Anonymous memory page fault";

void testcase(unsigned long long *iterations, unsigned long nr)
{
unsigned long pgsize = getpagesize();

while (1) {
unsigned long i;

char *c = mmap(NULL, MEMSIZE, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
assert(c != MAP_FAILED);

for (i = 0; i < MEMSIZE; i += pgsize) {
c[i] = 0;
(*iterations)++;
}

munmap(c, MEMSIZE);
}
}

--
Mateusz Guzik <mjguzik gmail.com>

Next message: FUJITA Tomonori: "Re: [PATCH v12 5/5] MAINTAINERS: rust: Add a new section for all of the time stuff"
Previous message: Ivan Vecera: "Re: [PATCH v2 06/14] mfd: zl3073x: Add macros for device registers access"
Next in thread: David Laight: "Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]