Re: x86 memcpy performance

From: Maarten Lankhorst
Date: Thu Sep 01 2011 - 11:15:30 EST


Hey,

2011/8/16 Borislav Petkov <bp@xxxxxxxxx>:
> On Mon, Aug 15, 2011 at 10:34:35PM -0400, Valdis.Kletnieks@xxxxxx wrote:
>> On Sun, 14 Aug 2011 11:59:10 +0200, Borislav Petkov said:
>>
>> > Benchmarking with 10000 iterations, average results:
>> > size XM MM speedup
>> > 119 540.58 449.491 0.8314969419
>>
>> > 12273 2307.86 4042.88 1.751787902
>> > 13924 2431.8 4224.48 1.737184756
>> > 14335 2469.4 4218.82 1.708440514
>> > 15018 2675.67 1904.07 0.711622886
>> > 16374 2989.75 5296.26 1.771470902
>> > 24564 4262.15 7696.86 1.805863077
>> > 27852 4362.53 3347.72 0.7673805572
>> > 28672 5122.8 7113.14 1.388524413
>> > 30033 4874.62 8740.04 1.792967931
>>
>> The numbers for 15018 and 27852 are *way* odd for the MM case. I don't feel
>> really good about this till we understand what happened for those two cases.
>
> Yep.
>
>> Also, anytime I see "10000 iterations", I ask myself if the benchmark
>> rigging took proper note of hot/cold cache issues. That *may* explain
>> the two oddball results we see above - but not knowing more about how
>> it was benched, it's hard to say.
>
> Yeah, the more scrutiny this gets the better. So I've cleaned up my
> setup and have attached it.
>
> xm_mem.c does the benchmarking and in bench_memcpy() there's the
> sse_memcpy call which is the SSE memcpy implementation using inline asm.
> It looks like gcc produces pretty crappy code here because if I replace
> the sse_memcpy call with xm_memcpy() from xm_memcpy.S - this is the
> same function but in pure asm - I get much better numbers, sometimes
> even over 2x. It all depends on the alignment of the buffers though.
> Also, those numbers don't include the context saving/restoring which the
> kernel does for us.
>
> 7491 1509.89 2346.94 1.554378381
> 8170 2166.81 2857.78 1.318890326
> 12277 2659.03 4179.31 1.571744176
> 13907 2571.24 4125.7 1.604558427
> 14319 2638.74 5799.67 2.19789466 <----
> 14993 2752.42 4413.85 1.603625603
> 16371 3479.11 5562.65 1.59887055

This work intrigued me, in some cases kernel memcpy was a lot faster than sse memcpy,
and I finally figured out why. I also extended the test to an optimized avx memcpy,
but I think the kernel memcpy will always win in the aligned case.

Those numbers you posted aren't right it seems. It depends a lot on the alignment,
for example if both are aligned to 64 relative to each other,
kernel memcpy will win from avx memcpy on my machine.

I replaced the malloc calls with memalign(65536, size + 256) so I could toy
around with the alignments a little. This explains why for some sizes, kernel
memcpy was faster than sse memcpy in the test results you had.
When (src & 63 == dst & 63), it seems that kernel memcpy always wins, otherwise
avx memcpy might.

If you want to speed up memcpy, I think your best bet is to find out why it's
so much slower when src and dst aren't 64-byte aligned compared to each other.

Cheers,
Maarten

---
Attached: my modified version of the sse memcpy you posted.

I changed it a bit, and used avx, but some of the other changes might
be better for your sse memcpy too.
/*
* ym_memcpy - AVX version of memcpy
*
* Input:
* rdi destination
* rsi source
* rdx count
*
* Output:
* rax original destination
*/
.globl ym_memcpy
.type ym_memcpy, @function

ym_memcpy:
mov %rdi, %rax

/* Target align */
movzbq %dil, %rcx
negb %cl
andb $0x1f, %cl
subq %rcx, %rdx
rep movsb

movq %rdx, %rcx
andq $0x1ff, %rdx
shrq $9, %rcx
jz .trailer

movb %sil, %r8b
andb $0x1f, %r8b
test %r8b, %r8b
jz .repeat_a

.align 32
.repeat_ua:
vmovups 0x0(%rsi), %ymm0
vmovups 0x20(%rsi), %ymm1
vmovups 0x40(%rsi), %ymm2
vmovups 0x60(%rsi), %ymm3
vmovups 0x80(%rsi), %ymm4
vmovups 0xa0(%rsi), %ymm5
vmovups 0xc0(%rsi), %ymm6
vmovups 0xe0(%rsi), %ymm7
vmovups 0x100(%rsi), %ymm8
vmovups 0x120(%rsi), %ymm9
vmovups 0x140(%rsi), %ymm10
vmovups 0x160(%rsi), %ymm11
vmovups 0x180(%rsi), %ymm12
vmovups 0x1a0(%rsi), %ymm13
vmovups 0x1c0(%rsi), %ymm14
vmovups 0x1e0(%rsi), %ymm15

vmovaps %ymm0, 0x0(%rdi)
vmovaps %ymm1, 0x20(%rdi)
vmovaps %ymm2, 0x40(%rdi)
vmovaps %ymm3, 0x60(%rdi)
vmovaps %ymm4, 0x80(%rdi)
vmovaps %ymm5, 0xa0(%rdi)
vmovaps %ymm6, 0xc0(%rdi)
vmovaps %ymm7, 0xe0(%rdi)
vmovaps %ymm8, 0x100(%rdi)
vmovaps %ymm9, 0x120(%rdi)
vmovaps %ymm10, 0x140(%rdi)
vmovaps %ymm11, 0x160(%rdi)
vmovaps %ymm12, 0x180(%rdi)
vmovaps %ymm13, 0x1a0(%rdi)
vmovaps %ymm14, 0x1c0(%rdi)
vmovaps %ymm15, 0x1e0(%rdi)

/* advance pointers */
addq $0x200, %rsi
addq $0x200, %rdi
subq $1, %rcx
jnz .repeat_ua
jz .trailer

.align 32
.repeat_a:
prefetchnta 0x80(%rsi)
prefetchnta 0x100(%rsi)
prefetchnta 0x180(%rsi)
vmovaps 0x0(%rsi), %ymm0
vmovaps 0x20(%rsi), %ymm1
vmovaps 0x40(%rsi), %ymm2
vmovaps 0x60(%rsi), %ymm3
vmovaps 0x80(%rsi), %ymm4
vmovaps 0xa0(%rsi), %ymm5
vmovaps 0xc0(%rsi), %ymm6
vmovaps 0xe0(%rsi), %ymm7
vmovaps 0x100(%rsi), %ymm8
vmovaps 0x120(%rsi), %ymm9
vmovaps 0x140(%rsi), %ymm10
vmovaps 0x160(%rsi), %ymm11
vmovaps 0x180(%rsi), %ymm12
vmovaps 0x1a0(%rsi), %ymm13
vmovaps 0x1c0(%rsi), %ymm14
vmovaps 0x1e0(%rsi), %ymm15

vmovaps %ymm0, 0x0(%rdi)
vmovaps %ymm1, 0x20(%rdi)
vmovaps %ymm2, 0x40(%rdi)
vmovaps %ymm3, 0x60(%rdi)
vmovaps %ymm4, 0x80(%rdi)
vmovaps %ymm5, 0xa0(%rdi)
vmovaps %ymm6, 0xc0(%rdi)
vmovaps %ymm7, 0xe0(%rdi)
vmovaps %ymm8, 0x100(%rdi)
vmovaps %ymm9, 0x120(%rdi)
vmovaps %ymm10, 0x140(%rdi)
vmovaps %ymm11, 0x160(%rdi)
vmovaps %ymm12, 0x180(%rdi)
vmovaps %ymm13, 0x1a0(%rdi)
vmovaps %ymm14, 0x1c0(%rdi)
vmovaps %ymm15, 0x1e0(%rdi)

/* advance pointers */
addq $0x200, %rsi
addq $0x200, %rdi
subq $1, %rcx
jnz .repeat_a

.align 32
.trailer:
movq %rdx, %rcx
shrq $3, %rcx
rep; movsq
movq %rdx, %rcx
andq $0x7, %rcx
rep; movsb
retq