Re: [PATCH] memcpy_flushcache: use cache flusing for larger lengths

From: Andy Lutomirski
Date: Tue Apr 07 2020 - 12:09:37 EST



> On Apr 7, 2020, at 8:01 AM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
> ï[ resending this to x86 maintainers ]
>
> Hi
>
> I tested performance of various methods how to write to optane-based
> persistent memory, and found out that non-temporal stores achieve
> throughput 1.3 GB/s. 8 cached stores immediatelly followed by clflushopt
> or clwb achieve throughput 1.6 GB/s.
>
> memcpy_flushcache uses non-temporal stores, I modified it to use cached
> stores + clflushopt and it improved performance of the dm-writecache
> target significantly:
>
> dm-writecache throughput:
> (dd if=/dev/zero of=/dev/mapper/wc bs=64k oflag=direct)
> writecache block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> For block size 512, movnti works better, for larger block sizes,
> clflushopt is better.
>
> I was also testing the novafs filesystem, it is not upstream, but it
> benefitted from similar change in __memcpy_flushcache and
> __copy_user_nocache:
> write throughput on big files - movnti: 662 MB/s, clwb: 1323 MB/s
> write throughput on small files - movnti: 621 MB/s, clwb: 1013 MB/s
>
>
> I submit this patch for __memcpy_flushcache that improves dm-writecache
> performance.
>
> Other ideas - should we introduce memcpy_to_pmem instead of modifying
> memcpy_flushcache and move this logic there? Or should I modify the
> dm-writecache target directly to use clflushopt with no change to the
> architecture-specific code?
>
> Mikulas
>
>
>
>
> From: Mikulas Patocka <mpatocka@xxxxxxxxxx>
>
> I tested dm-writecache performance on a machine with Optane nvdimm and it
> turned out that for larger writes, cached stores + cache flushing perform
> better than non-temporal stores. This is the throughput of dm-writecache
> measured with this command:
> dd if=/dev/zero of=/dev/mapper/wc bs=64 oflag=direct
>
> block size 512 1024 2048 4096
> movnti 496 MB/s 642 MB/s 725 MB/s 744 MB/s
> clflushopt 373 MB/s 688 MB/s 1.1 GB/s 1.2 GB/s
>
> We can see that for smaller block, movnti performs better, but for larger
> blocks, clflushopt has better performance.
>
> This patch changes the function __memcpy_flushcache accordingly, so that
> with size >= 768 it performs cached stores and cache flushing. Note that
> we must not use the new branch if the CPU doesn't have clflushopt - in
> that case, the kernel would use inefficient "clflush" instruction that has
> very bad performance.
>
> Signed-off-by: Mikulas Patocka <mpatocka@xxxxxxxxxx>
>
> ---
> arch/x86/lib/usercopy_64.c | 36 ++++++++++++++++++++++++++++++++++++
> 1 file changed, 36 insertions(+)
>
> Index: linux-2.6/arch/x86/lib/usercopy_64.c
> ===================================================================
> --- linux-2.6.orig/arch/x86/lib/usercopy_64.c 2020-03-24 15:15:36.644945091 -0400
> +++ linux-2.6/arch/x86/lib/usercopy_64.c 2020-03-30 07:17:51.450290007 -0400
> @@ -152,6 +152,42 @@ void __memcpy_flushcache(void *_dst, con
> return;
> }
>
> + if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) && size >= 768 && likely(boot_cpu_data.x86_clflush_size == 64)) {
> + while (!IS_ALIGNED(dest, 64)) {
> + asm("movq (%0), %%r8\n"
> + "movnti %%r8, (%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8");
> + dest += 8;
> + source += 8;
> + size -= 8;
> + }
> + do {
> + asm("movq (%0), %%r8\n"
> + "movq 8(%0), %%r9\n"
> + "movq 16(%0), %%r10\n"
> + "movq 24(%0), %%r11\n"
> + "movq %%r8, (%1)\n"
> + "movq %%r9, 8(%1)\n"
> + "movq %%r10, 16(%1)\n"
> + "movq %%r11, 24(%1)\n"
> + "movq 32(%0), %%r8\n"
> + "movq 40(%0), %%r9\n"
> + "movq 48(%0), %%r10\n"
> + "movq 56(%0), %%r11\n"
> + "movq %%r8, 32(%1)\n"
> + "movq %%r9, 40(%1)\n"
> + "movq %%r10, 48(%1)\n"
> + "movq %%r11, 56(%1)\n"
> + :: "r" (source), "r" (dest)
> + : "memory", "r8", "r9", "r10", "r11");

Does this actually work better than the corresponding C code?

Also, that memory clobber probably isnât doing your code generation any favors. Experimentally, you have the constraints wrong. An ârâ constraint doesnât tell GCC that you are dereferencing the pointer. You need to use âmâ with a correctly-sized type. But I bet plain C is at least as good.

> + clflushopt((void *)dest);
> + dest += 64;
> + source += 64;
> + size -= 64;
> + } while (size >= 64);
> + }
> +
> /* 4x8 movnti loop */
> while (size >= 32) {
> asm("movq (%0), %%r8\n"
>