Re: [RFC PATCH] x86: prevent gcc from emitting rep movsq/stosq for inlined ops

From: Mateusz Guzik
Date: Sun Apr 13 2025 - 15:01:35 EST


On Sun, Apr 13, 2025 at 8:20 PM David Laight
<david.laight.linux@xxxxxxxxx> wrote:
>
> On Sun, 13 Apr 2025 12:27:08 +0200
> Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
>
> > On Wed, Apr 2, 2025 at 6:27 PM Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > >
> > > On Wed, Apr 2, 2025 at 6:22 PM Linus Torvalds
> > > <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > > >
> > > > On Wed, 2 Apr 2025 at 06:42, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> > > > >
> > > > >
> > > > > +ifdef CONFIG_CC_IS_GCC
> > > > > +#
> > > > > +# Inline memcpy and memset handling policy for gcc.
> > > > > +#
> > > > > +# For ops of sizes known at compilation time it quickly resorts to issuing rep
> > > > > +# movsq and stosq. On most uarchs rep-prefixed ops have a significant startup
> > > > > +# latency and it is faster to issue regular stores (even if in loops) to handle
> > > > > +# small buffers.
> > > > > +#
> > > > > +# This of course comes at an expense in terms of i-cache footprint. bloat-o-meter
> > > > > +# reported 0.23% increase for enabling these.
> > > > > +#
> > > > > +# We inline up to 256 bytes, which in the best case issues few movs, in the
> > > > > +# worst case creates a 4 * 8 store loop.
> > > > > +#
> > > > > +# The upper limit was chosen semi-arbitrarily -- uarchs wildly differ between a
> > > > > +# threshold past which a rep-prefixed op becomes faster, 256 being the lowest
> > > > > +# common denominator. Someone(tm) should revisit this from time to time.
> > > > > +#
> > > > > +KBUILD_CFLAGS += -mmemcpy-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > > > +KBUILD_CFLAGS += -mmemset-strategy=unrolled_loop:256:noalign,libcall:-1:noalign
> > > > > +endif
> > > >
> > > > Please make this a gcc bug-report instead - I really don't want to
> > > > have random compiler-specific tuning options in the kernel.
> > > >
> > > > Because that whole memcpy-strategy thing is something that gets tuned
> > > > by a lot of other compiler options (ie -march and different versions).
> > > >
> > >
> > > Ok.
> >
> > So I reported this upstream:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119596
> >
> > And found some other problems in the meantime:
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119703
> > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119704
> >
> > Looks like this particular bit was persisting for quite some time now.
> >
> > I also confirmed there is a benefit on AMD CPUs.
>
> Is that a benefit of doing 'rep movsb' or a benefit of not doing it?
>

It is a benefit to issue regular stores instead of rep movsq, at least
up to the range I tested.

I make no claim this is the fastest thing out there for any uarch,
merely that regular stores beat what gcc is emitting now.

> It also depends very much of the actual cpu.
> I think zen5 are faster (at running 'rep movsb') than earlier ones.
> But someone needs to run the same test on a range of cpu.
>
> I've found a 'cunning plan' to actually measure instruction clock times.
> While 'mfence' will wait for all the instructions to complete, it is
> horribly expensive.
> The trick is to use data dependencies and the 'pmc' cycle counter.
> So something like:
> volatile int always_zero;
> ...
> int zero = always_zero;
> start = rdpmc(reg_no);
> updated = do_rep_movsb(dst, src, count + (start & zero));
> end = rdpmc(reg_no + (updated & zero);
> elapsed = end - start;
> So the cpu has to execute the rdpmc() either side of the code
> being tested.
> For 'rep_movsb' it might be reasonable to use the updated address (or count),
> but you could read back the last memory location to get a true execution time.
>
> I've not tried to time memcpy() loops that way, but of arithmetic you
> can measure the data dependency of the clock could for divide.
>

I'm rather wary of ubenches of the sort as they detach the actual op
from its natural environment (if you will).

The good news is that the page fault ubench I added has a specific
memcpy as one of the bottlenecks (in sync_regs()). Improvements one
way or the other for that size can be measured without any
disturbances like the above.

--
Mateusz Guzik <mjguzik gmail.com>