RE: [PATCH] x86_64: new and improved memset()

From: David Laight
Date: Mon Sep 16 2019 - 10:19:04 EST

Next message: Lucas Stach: "Re: [PATCH 2/4] dmaengine: imx-sdma: fix dma freezes"
Previous message: Lucas Stach: "Re: [PATCH 1/4] dmaengine: imx-sdma: fix buffer ownership"
In reply to: kbuild test robot: "Re: [PATCH] x86_64: new and improved memset()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Alexey Dobriyan
> Sent: 14 September 2019 11:34
...
> +ENTRY(memset0_rep_stosq)
> + xor eax, eax
> +.globl memsetx_rep_stosq
> +memsetx_rep_stosq:
> + lea rsi, [rdi + rcx]
> + shr rcx, 3
> + rep stosq
> + cmp rdi, rsi
> + je 1f
> +2:
> + mov [rdi], al
> + add rdi, 1
> + cmp rdi, rsi
> + jne 2b
> +1:
> + ret

You can do the 'trailing bytes' first with a potentially misaligned store.
Something like (modulo asm syntax and argument ordering):
lea rsi, [rdi + rdx]
shr rcx, 3
jcxz 1f # Short buffer
mov -8[rsi], rax
rep stosq
ret
1:
mov [rdi], al
add rdi, 1
cmp rdi, rsi
jne 1b
ret

The final loop can be one instruction shorter by arranging to do:
1:
mov [rdi+rxx], al
add rdi, 1
jnz 1b
ret

Last I looked 'jcxz' was 'ok' on all recent amd and intel cpus.
OTOH 'loop' is horrid on intel ones.

The same applies to the other versions.

I suspect it isn't worth optimising to realign misaligned buffers
they are unlikely to happen often enough.

I also think that gcc's __builtin version does some of the short
buffer optimisations already.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

Next message: Lucas Stach: "Re: [PATCH 2/4] dmaengine: imx-sdma: fix dma freezes"
Previous message: Lucas Stach: "Re: [PATCH 1/4] dmaengine: imx-sdma: fix buffer ownership"
In reply to: kbuild test robot: "Re: [PATCH] x86_64: new and improved memset()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]