Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops
From: David Laight
Date: Sun Apr 13 2025 - 14:20:29 EST
On Sun, 13 Apr 2025 12:27:08 +0200
Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> On Wed, Apr 2, 2025 at 6:27 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> >
> > On Wed, Apr 2, 2025 at 6:22 PM Linus Torvalds
> > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > >
> > > On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > >
> > > >
> > > > +ifdef CONFIG_CC_IS_GCC
> > > > +#
> > > > +# Inline memcpy and memset handling policy for gcc.
> > > > +#
> > > > +# For ops of sizes known at compilation time it quickly resorts to issuing rep
> > > > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
> > > > +# latency and it is faster to issue regular stores (even if in loops) to handle
> > > > +# small buffers.
> > > > +#
> > > > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
> > > > +# reported 0.23% increase for enabling these.
> > > > +#
> > > > +# We inline up to 256 bytes, which in the best case issues few movs, in the
> > > > +# worst case creates a 4 * 8 store loop.
> > > > +#
> > > > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
> > > > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
> > > > +# common denominator. Someone(tm) should revisit this from time to time.
> > > > +#
> > > > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > > +endif
> > >
> > > Please make this a gcc bug-report instead - I really don't want to
> > > have random compiler-specific tuning options in the kernel.
> > >
> > > Because that whole memcpy-strategy thing is something that gets tuned
> > > by a lot of other compiler options (ie -march and different versions).
> > >
> >
> > Ok.
>
> So I reported this upstream:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
>
> And found some other problems in the meantime:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704
>
> Looks like this particular bit was persisting for quite some time now.
>
> I also confirmed there is a benefit on AMD CPUs.
Is that a benefit of doing 'rep movsb' or a benefit of not doing it?
It also depends very much of the actual cpu.
I think zen5 are faster (at running 'rep movsb') than earlier ones.
But someone needs to run the same test on a range of cpu.
I've found a 'cunning plan' to actually measure instruction clock times.
While 'mfence' will wait for all the instructions to complete, it is
horribly expensive.
The trick is to use data dependencies and the 'pmc' cycle counter.
So something like:
volatile int always_zero;
...
int zero = always_zero;
start = rdpmc(reg_no);
updated = do_rep_movsb(dst, src, count + (start & zero));
end = rdpmc(reg_no + (updated & zero);
elapsed = end - start;
So the cpu has to execute the rdpmc() either side of the code
being tested.
For 'rep_movsb' it might be reasonable to use the updated address (or count),
but you could read back the last memory location to get a true execution time.
I've not tried to time memcpy() loops that way, but of arithmetic you
can measure the data dependency of the clock could for divide.
David