Re: [PATCH] x86: handle the tail in rep_movs_alternative() with an overlapping store

From: Mateusz Guzik
Date: Thu Mar 20 2025 - 15:33:57 EST


On Thu, Mar 20, 2025 at 8:23 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Thu, 20 Mar 2025 at 12:06, Mateusz Guzik <mjguzik@xxxxxxxxx> wrote:
> >
> > Sizes ranged <8,64> are copied 8 bytes at a time with a jump out to a
> > 1 byte at a time loop to handle the tail.
>
> I definitely do not mind this patch, but I think it doesn't go far enough.
>
> It gets rid of the byte-at-a-time loop at the end, but only for the
> short-copy case of 8-63 bytes.
>

This bit I can vouch for.

> The .Llarge_movsq ends up still doing
>
> testl %ecx,%ecx
> jne .Lcopy_user_tail
> RET
>
> and while that is only triggered by the non-ERMS case, that's what
> most older AMD CPU's will trigger, afaik.
>

This bit I can't.

Per my other e-mail it has been several years since I was seriously
digging in the area (around 7 by now I think) and details are rather
fuzzy.

I have a recollection that handling the tail after rep movsq with an
overlapping store was suffering a penalty big enough to warrant a
"normal" copy instead, avoiding the just written to area. I see my old
routine $elsewhere makes sure to do it. I don't have sensible hw to
bench this on either at the moment.

That said, if you insist on it, I'll repost v2 with the change (I'm
going to *test* it of course, just not bench. :>)
--
Mateusz Guzik <mjguzik gmail.com>