[PATCH v3 09/21] x86/asm: add clear_pages_movnt()
From: Ankur Arora
Date: Mon Jun 06 2022 - 16:46:32 EST
Add clear_pages_movnt(), which uses MOVNTI as the underlying primitive.
With this, page-clearing can skip the memory hierarchy, thus providing
a non cache-polluting implementation of clear_pages().
MOVNTI, from the Intel SDM, Volume 2B, 4-101:
"The non-temporal hint is implemented by using a write combining (WC)
memory type protocol when writing the data to memory. Using this
protocol, the processor does not write the data into the cache
hierarchy, nor does it fetch the corresponding cache line from memory
into the cache hierarchy."
The AMD Arch Manual has something similar to say as well.
One use-case is to zero large extents without bringing in never-to-be-
accessed cachelines. Also, often clear_pages_movnt() based clearing is
faster once extent sizes are O(LLC-size).
As the excerpt notes, MOVNTI is weakly ordered with respect to other
instructions operating on the memory hierarchy. This needs to be
handled by the caller by executing an SFENCE when done.
The implementation is straight-forward: unroll the inner loop to keep
the code similar to memset_movnti(), so that we can gauge
clear_pages_movnt() performance via perf bench mem memset.
# Intel Icelakex
# Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
# (X86_FEATURE_ERMS) and x86-64-movnt:
System: Oracle X9-2 (2 nodes * 32 cores * 2 threads)
Processor: Intel Xeon(R) Platinum 8358 CPU @ 2.60GHz (Icelakex, 6:106:6)
Memory: 512 GB evenly split between nodes
LLC-size: 48MB for each node (32-cores * 2-threads)
no_turbo: 1, Microcode: 0xd0001e0, scaling-governor: performance
x86-64-stosb (5 runs) x86-64-movnt (5 runs) Delta(%)
---------------------- --------------------- --------
size BW ( stdev) BW ( stdev)
2MB 14.37 GB/s ( +- 1.55) 12.59 GB/s ( +- 1.20) -12.38%
16MB 16.93 GB/s ( +- 2.61) 15.91 GB/s ( +- 2.74) -6.02%
128MB 12.12 GB/s ( +- 1.06) 22.33 GB/s ( +- 1.84) +84.24%
1024MB 12.12 GB/s ( +- 0.02) 23.92 GB/s ( +- 0.14) +97.35%
4096MB 12.08 GB/s ( +- 0.02) 23.98 GB/s ( +- 0.18) +98.50%
Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
arch/x86/include/asm/page_64.h | 1 +
arch/x86/lib/clear_page_64.S | 21 +++++++++++++++++++++
2 files changed, 22 insertions(+)
diff --git a/arch/x86/include/asm/page_64.h b/arch/x86/include/asm/page_64.h
index a88a3508888a..3affc4ecb8da 100644
--- a/arch/x86/include/asm/page_64.h
+++ b/arch/x86/include/asm/page_64.h
@@ -55,6 +55,7 @@ extern unsigned long __phys_addr_symbol(unsigned long);
void clear_pages_orig(void *page, unsigned long npages);
void clear_pages_rep(void *page, unsigned long npages);
void clear_pages_erms(void *page, unsigned long npages);
+void clear_pages_movnt(void *page, unsigned long npages);
#define __HAVE_ARCH_CLEAR_USER_PAGES
static inline void clear_pages(void *page, unsigned int npages)
diff --git a/arch/x86/lib/clear_page_64.S b/arch/x86/lib/clear_page_64.S
index 2cc3b681734a..83d14f1c9f57 100644
--- a/arch/x86/lib/clear_page_64.S
+++ b/arch/x86/lib/clear_page_64.S
@@ -58,3 +58,24 @@ SYM_FUNC_START(clear_pages_erms)
RET
SYM_FUNC_END(clear_pages_erms)
EXPORT_SYMBOL_GPL(clear_pages_erms)
+
+SYM_FUNC_START(clear_pages_movnt)
+ xorl %eax,%eax
+ movq %rsi,%rcx
+ shlq $PAGE_SHIFT, %rcx
+
+ .p2align 4
+.Lstart:
+ movnti %rax, 0x00(%rdi)
+ movnti %rax, 0x08(%rdi)
+ movnti %rax, 0x10(%rdi)
+ movnti %rax, 0x18(%rdi)
+ movnti %rax, 0x20(%rdi)
+ movnti %rax, 0x28(%rdi)
+ movnti %rax, 0x30(%rdi)
+ movnti %rax, 0x38(%rdi)
+ addq $0x40, %rdi
+ subl $0x40, %ecx
+ ja .Lstart
+ RET
+SYM_FUNC_END(clear_pages_movnt)
--
2.31.1