RE: [PATCH] x86_64: new and improved memset()
From: David Laight
Date: Mon Sep 16 2019 - 10:19:04 EST
From: Alexey Dobriyan
> Sent: 14 September 2019 11:34
...
> +ENTRY(memset0_rep_stosq)
> + xor eax, eax
> +.globl memsetx_rep_stosq
> +memsetx_rep_stosq:
> + lea rsi, [rdi + rcx]
> + shr rcx, 3
> + rep stosq
> + cmp rdi, rsi
> + je 1f
> +2:
> + mov [rdi], al
> + add rdi, 1
> + cmp rdi, rsi
> + jne 2b
> +1:
> + ret
You can do the 'trailing bytes' first with a potentially misaligned store.
Something like (modulo asm syntax and argument ordering):
lea rsi, [rdi + rdx]
shr rcx, 3
jcxz 1f # Short buffer
mov -8[rsi], rax
rep stosq
ret
1:
mov [rdi], al
add rdi, 1
cmp rdi, rsi
jne 1b
ret
The final loop can be one instruction shorter by arranging to do:
1:
mov [rdi+rxx], al
add rdi, 1
jnz 1b
ret
Last I looked 'jcxz' was 'ok' on all recent amd and intel cpus.
OTOH 'loop' is horrid on intel ones.
The same applies to the other versions.
I suspect it isn't worth optimising to realign misaligned buffers
they are unlikely to happen often enough.
I also think that gcc's __builtin version does some of the short
buffer optimisations already.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)