Re: [PATCH v3] x86, mem: move memmove to out of line assembler
From: Rasmus Villemoes
Date: Wed Sep 28 2022 - 03:28:21 EST
On 27/09/2022 23.02, Nick Desaulniers wrote:
> + /* Decide forward/backward copy mode */
> + cmpl dest, src
> + jb .Lbackwards_header
I know you're mostly just moving existing code, but for my own education
I'd like to understand this.
> + /*
> + * movs instruction have many startup latency
> + * so we handle small size by general register.
> + */
> + cmpl $680, n
> + jb .Ltoo_small_forwards
OK, this I get, there's some overhead, and hence we need _some_ cutoff
value; 680 is probably chosen by some trial-and-error, but the exact
value likely doesn't matter too much.
> + /*
> + * movs instruction is only good for aligned case.
> + */
> + movl src, tmp0
> + xorl dest, tmp0
> + andl $0xff, tmp0
> + jz .Lforward_movs
But this part I don't understand at all. This checks that the src and
dest have the same %256 value, which is a rather odd thing, and very
unlikely to ever be hit in practice. I could understand if it checked
that they were both 4 or 8 or 16-byte aligned (i.e., (src|dest)&FOO)),
or if it checked that they had the same offset within a cacheline [say
(src^dest)&0x3f].
Any idea where that comes from? Or am I just incapable of reading x86 asm?
> +.Ltoo_small_forwards:
> + subl $0x10, n
> +
> + /*
> + * We gobble 16 bytes forward in each loop.
> + */
> +.L16_byteswap_forwards_loop:
> + subl $0x10, n
> + movl 0*4(src), tmp0
> + movl 1*4(src), tmp1
> + movl tmp0, 0*4(dest)
> + movl tmp1, 1*4(dest)
> + movl 2*4(src), tmp0
> + movl 3*4(src), tmp1
> + movl tmp0, 2*4(dest)
> + movl tmp1, 3*4(dest)
> + leal 0x10(src), src
> + leal 0x10(dest), dest
> + jae .L16_byteswap_forwards_loop
> + addl $0x10, n
> + jmp .L16_byteswap
> +
> + /*
> + * Handle data forward by movs.
> + */
> +.p2align 4
> +.Lforward_movs:
> + movl -4(src, n), tmp0
> + leal -4(dest, n), tmp1
> + shrl $2, n
> + rep movsl
> + movl tmp0, (tmp1)
> + jmp .Ldone
So in the original code, %1 was forced to be %esi and %2 was forced to
be %edi and they were initialized by src and dest. But here I fail to
see how those registers have been properly set up before the rep movs;
your names for those are tmp0 and tmp2. You have just loaded the last
word of the source to %edi, and AFAICT %esi aka tmp2 is entirely
uninitialized at this point (the only use is in L16_byteswap).
I must be missing something. Please enlighten me.
Rasmus