Re: x86 memcpy performance

From: Maarten Lankhorst
Date: Fri Sep 09 2011 - 06:34:27 EST


Hey,

On 09/09/2011 10:14 AM, Borislav Petkov wrote:
> On Thu, Sep 08, 2011 at 12:58:13PM +0200, Maarten Lankhorst wrote:
>> I have changed your sse memcpy to test various alignments with
>> source/destination offsets instead of random, from that you can
>> see that you don't really get a speedup at all. It seems to be more
>> a case of 'kernel memcpy is significantly slower with some alignments',
>> than 'avx memcpy is just that much faster'.
>>
>> For example 3754 with src misalignment 4 and target misalignment 20
>> takes 1185 units on avx memcpy, but 1480 units with kernel memcpy
> Right, so the idea is to check whether with the bigger buffer sizes
> (and misaligned, although this should not be that often the case in
> the kernel) the SSE version would outperform a "rep movs" with ucode
> optimizations not kicking in.
>
> With your version modified back to SSE memcpy (don't have an AVX box
> right now) I get on an AMD F10h:
>
> ...
> 16384(12/40) 4756.24 7867.74 1.654192552
> 16384(40/12) 5067.81 6068.71 1.197500008
> 16384(12/44) 4341.3 8474.96 1.952172387
> 16384(44/12) 4277.13 7107.64 1.661777347
> 16384(12/48) 4989.16 7964.54 1.596369011
> 16384(48/12) 4644.94 6499.5 1.399264281
> ...
>
> which looks like pretty nice numbers to me. I can't say whether there
> ever is 16K buffer we copy in the kernel but if there were... But <16K
> buffers also show up to 1.5x speedup. So I'd say it's a uarch thing.
> As I said, best it would be to put it in the kernel and run a bunch of
> benchmarks...
I think for bigger memcpy's it might make sense to demand stricter
alignment. What are your numbers for (0/0) ? In my case it seems
that kernel memcpy is always faster for that. In fact, it seems
src&63 == dst&63 is generally faster with kernel memcpy.

Patching my tree to WARN_ON_ONCE for when this condition isn't true, I get the following warnings:

WARNING: at arch/x86/kernel/head64.c:49 x86_64_start_reservations+0x3b/0x18d()
WARNING: at arch/x86/kernel/head64.c:52 x86_64_start_reservations+0xcb/0x18d()
WARNING: at arch/x86/kernel/e820.c:1077 setup_memory_map+0x3b/0x72()
WARNING: at kernel/fork.c:938 copy_process+0x148f/0x1550()
WARNING: at arch/x86/vdso/vdso32-setup.c:306 sysenter_setup+0xd4/0x301()
WARNING: at mm/util.c:72 kmemdup+0x75/0x80()
WARNING: at fs/btrfs/disk-io.c:1742 open_ctree+0x1ab5/0x1bb0()
WARNING: at fs/btrfs/disk-io.c:1744 open_ctree+0x1b35/0x1bb0()
WARNING: at fs/btrfs/extent_io.c:3634 write_extent_buffer+0x209/0x240()
WARNING: at fs/exec.c:1002 flush_old_exec+0x6c3/0x750()
WARNING: at fs/btrfs/extent_io.c:3496 read_extent_buffer+0x1b1/0x1e0()
WARNING: at kernel/module.c:2585 load_module+0x1933/0x1c30()
WARNING: at fs/btrfs/extent_io.c:3748 memcpy_extent_buffer+0x2aa/0x2f0()
WARNING: at fs/btrfs/disk-io.c:2276 write_dev_supers+0x34e/0x360()
WARNING: at lib/swiotlb.c:367 swiotlb_bounce+0xc6/0xe0()
WARNING: at fs/btrfs/transaction.c:1387 btrfs_commit_transaction+0x867/0x8a0()
WARNING: at drivers/tty/serial/serial_core.c:527 uart_write+0x14a/0x160()
WARNING: at mm/memory.c:3830 __access_remote_vm+0x251/0x270()

The most persistent one appears to be the btrfs' *_extent_buffer,
it gets the most warnings on my system. Apart from that on my
system there's not much to gain, since the alignment is already
close to optimal.

My ext4 /home doesn't throw warnings, so I'd gain the most
by figuring out if I could improve btrfs/extent_io.c in some way.
The patch for triggering those warnings is below, change to WARN_ON
if you want to see which one happens the most for you.

I was pleasantly surprised though.

>> The modified testcase is attached, I did some optimizations in avx
>> memcpy, but I fear I may be missing something, when I tried to put it
>> in the kernel, it complained about sata errors I never had before,
>> so I immediately went for the power button to prevent more errors,
>> fortunately it only corrupted some kernel object files, and btrfs
>> threw checksum errors. :)
> Well, your version should do something similar to what _mmx_memcpy does:
> save FPU state and not execute in IRQ context.
>
>> All in all I think testing in userspace is safer, you might want to
>> run it on an idle cpu with schedtool, with a high fifo priority, and
>> set cpufreq governor to performance.
> No, you need a generic system with default settings - otherwise it is
> blatant benchmark lying :-)

diff --git a/arch/x86/include/asm/string_64.h b/arch/x86/include/asm/string_64.h
index 19e2c46..77180bb 100644
--- a/arch/x86/include/asm/string_64.h
+++ b/arch/x86/include/asm/string_64.h
@@ -30,6 +30,14 @@ static __always_inline void *__inline_memcpy(void *to, const void *from, size_t
#ifndef CONFIG_KMEMCHECK
#if (__GNUC__ == 4 && __GNUC_MINOR__ >= 3) || __GNUC__ > 4
extern void *memcpy(void *to, const void *from, size_t len);
+#define memcpy(dst, src, len) \
+({ \
+ size_t __len = (len); \
+ const void *__src = (src); \
+ void *__dst = (dst); \
+ WARN_ON_ONCE(__len > 1024 && (((long)__src & 63) != ((long)__dst & 63))); \
+ memcpy(__dst, __src, __len); \
+})
#else
extern void *__memcpy(void *to, const void *from, size_t len);
#define memcpy(dst, src, len) \


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/