Re: [PATCH RESEND] x86/asm/32: Modernize _memcpy()

From: Uros Bizjak
Date: Tue Dec 16 2025 - 11:37:09 EST


On Tue, Dec 16, 2025 at 2:14 PM David Laight
<david.laight.linux@xxxxxxxxx> wrote:

> > 00e778b0 <memcpy>:
> > e778b0: 55 push %ebp
> > e778b1: 89 e5 mov %esp,%ebp
> > e778b3: 83 ec 08 sub $0x8,%esp
> > e778b6: 89 75 f8 mov %esi,-0x8(%ebp)
> > e778b9: 89 d6 mov %edx,%esi
> > e778bb: 89 ca mov %ecx,%edx
> > e778bd: 89 7d fc mov %edi,-0x4(%ebp)
> > e778c0: c1 e9 02 shr $0x2,%ecx
> > e778c3: 89 c7 mov %eax,%edi
> > e778c5: f3 a5 rep movsl %ds:(%esi),%es:(%edi)
> > e778c7: 83 e2 03 and $0x3,%edx
> > e778ca: 74 04 je e778d0 <memcpy+0x20>
> > e778cc: 89 d1 mov %edx,%ecx
> > e778ce: f3 a4 rep movsb %ds:(%esi),%es:(%edi)
> > e778d0: 8b 75 f8 mov -0x8(%ebp),%esi
> > e778d3: 8b 7d fc mov -0x4(%ebp),%edi
> > e778d6: 89 ec mov %ebp,%esp
> > e778d8: 5d pop %ebp
> > e778d9: c3 ret
> >
> > due to a better register allocation, avoiding the call-saved
> > %ebx register.
>
> That'll might be semi-random.

Not really, the compiler has more freedom to allocate more optimal register.

> > + unsigned long ecx = n >> 2;
> > +
> > + asm volatile("rep movsl"
> > + : "+D" (edi), "+S" (esi), "+c" (ecx)
> > + : : "memory");
> > + ecx = n & 3;
> > + if (ecx)
> > + asm volatile("rep movsb"
> > + : "+D" (edi), "+S" (esi), "+c" (ecx)
> > + : : "memory");
> > return to;
> > }
> >

> This version seems to generate better code still:
> see https://godbolt.org/z/78cq97PPj
>
> void *__memcpy(void *to, const void *from, unsigned long n)
> {
> unsigned long ecx = n >> 2;
>
> asm volatile("rep movsl"
> : "+D" (to), "+S" (from), "+c" (ecx)
> : : "memory");
> ecx = n & 3;
> if (ecx)
> asm volatile("rep movsb"
> : "+D" (to), "+S" (from), "+c" (ecx)
> : : "memory");
> return (char *)to - n;

I don't think that additional subtraction outweighs a move from EAX to
a temporary.

BR,
Uros.