[PATCH 7/8] x86/cpu/intel: enable X86_FEATURE_NT_GOOD on Intel Broadwellx

From: Ankur Arora
Date: Wed Oct 14 2020 - 04:34:15 EST


System: Oracle X6-2
CPU: 2 nodes * 10 cores/node * 2 threads/core
Intel Xeon E5-2630 v4 (Broadwellx, 6:79:1)
Memory: 256 GB evenly split between nodes
Microcode: 0xb00002e
scaling_governor: performance
L3 size: 25MB
intel_pstate/no_turbo: 1

Performance comparison of 'perf bench mem memset -l 1' for x86-64-stosb
(X86_FEATURE_ERMS) and x86-64-movnt (X86_FEATURE_NT_GOOD):

x86-64-stosb (5 runs) x86-64-movnt (5 runs) speedup
----------------------- ----------------------- -------
size BW ( pstdev) BW ( pstdev)

16MB 17.35 GB/s ( +- 9.27%) 11.83 GB/s ( +- 0.19%) -31.81%
128MB 5.31 GB/s ( +- 0.13%) 11.72 GB/s ( +- 0.44%) +121.84%
1024MB 5.42 GB/s ( +- 0.13%) 11.78 GB/s ( +- 0.03%) +117.34%
4096MB 5.41 GB/s ( +- 0.41%) 11.76 GB/s ( +- 0.07%) +117.37%

The next workload exercises the page-clearing path directly by faulting over
an anonymous mmap region backed by 1GB pages. This workload is similar to the
creation phase of pinned guests in QEMU.

$ cat pf-test.c
#include <stdlib.h>
#include <sys/mman.h>
#include <linux/mman.h>

#define HPAGE_BITS 30
int main(int argc, char **argv) {
int i;
unsigned long len = atoi(argv[1]); /* In GB */
unsigned long offset = 0;
unsigned long numpages;
char *base;

len *= 1UL << 30;
numpages = len >> HPAGE_BITS;

base = mmap(NULL, len, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS |
MAP_HUGETLB | MAP_HUGE_1GB, 0, 0);

for (i = 0; i < numpages; i++) {
*((volatile char *)base + offset) = *(base + offset);
offset += 1UL << HPAGE_BITS;
}

return 0;
}

The specific test is for a 128GB region but this is a single-threaded
O(n) workload so the exact region size is not material.

Page-clearing throughput for clear_page_erms(): 3.72 GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

Performance counter stats for 'bin/pf-test 128' (5 runs):

74,799,496,556 cpu-cycles # 2.176 GHz ( +- 2.22% ) (29.41%)
1,474,615,023 instructions # 0.02 insn per cycle ( +- 0.23% ) (35.29%)
2,148,580,131 cache-references # 62.502 M/sec ( +- 0.02% ) (35.29%)
71,736,985 cache-misses # 3.339 % of all cache refs ( +- 0.94% ) (35.29%)
433,713,165 branch-instructions # 12.617 M/sec ( +- 0.15% ) (35.30%)
1,008,251 branch-misses # 0.23% of all branches ( +- 1.88% ) (35.30%)
3,406,821,966 bus-cycles # 99.104 M/sec ( +- 2.22% ) (23.53%)
2,156,059,110 L1-dcache-load-misses # 445.35% of all L1-dcache accesses ( +- 0.01% ) (23.53%)
484,128,243 L1-dcache-loads # 14.083 M/sec ( +- 0.22% ) (23.53%)
944,216 LLC-loads # 0.027 M/sec ( +- 7.41% ) (23.53%)
537,989 LLC-load-misses # 56.98% of all LL-cache accesses ( +- 13.64% ) (23.53%)
2,150,138,476 LLC-stores # 62.547 M/sec ( +- 0.01% ) (11.76%)
69,598,760 LLC-store-misses # 2.025 M/sec ( +- 0.47% ) (11.76%)
483,923,875 dTLB-loads # 14.077 M/sec ( +- 0.21% ) (17.64%)
1,892 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 30.63% ) (23.53%)
4,799,154,980 dTLB-stores # 139.606 M/sec ( +- 0.03% ) (23.53%)
90 dTLB-store-misses # 0.003 K/sec ( +- 35.92% ) (23.53%)

34.377 +- 0.760 seconds time elapsed ( +- 2.21% )

Page-clearing throughput with clear_page_nt(): 11.78GBps
$ perf stat -r 5 --all-kernel -e ... bin/pf-test 128

Performance counter stats for 'bin/pf-test 128' (5 runs):

23,699,446,603 cpu-cycles # 2.182 GHz ( +- 0.01% ) (23.53%)
24,794,548,512 instructions # 1.05 insn per cycle ( +- 0.00% ) (29.41%)
432,775 cache-references # 0.040 M/sec ( +- 3.96% ) (29.41%)
75,580 cache-misses # 17.464 % of all cache refs ( +- 51.42% ) (29.41%)
2,492,858,290 branch-instructions # 229.475 M/sec ( +- 0.00% ) (29.42%)
34,016,826 branch-misses # 1.36% of all branches ( +- 0.04% ) (29.42%)
1,078,468,643 bus-cycles # 99.276 M/sec ( +- 0.01% ) (23.53%)
717,228 L1-dcache-load-misses # 0.20% of all L1-dcache accesses ( +- 3.77% ) (23.53%)
351,999,535 L1-dcache-loads # 32.403 M/sec ( +- 0.04% ) (23.53%)
75,988 LLC-loads # 0.007 M/sec ( +- 4.20% ) (23.53%)
24,503 LLC-load-misses # 32.25% of all LL-cache accesses ( +- 53.30% ) (23.53%)
57,283 LLC-stores # 0.005 M/sec ( +- 2.15% ) (11.76%)
19,738 LLC-store-misses # 0.002 M/sec ( +- 46.55% ) (11.76%)
351,836,498 dTLB-loads # 32.388 M/sec ( +- 0.04% ) (17.65%)
1,171 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 42.68% ) (23.53%)
17,385,579,725 dTLB-stores # 1600.392 M/sec ( +- 0.00% ) (23.53%)
200 dTLB-store-misses # 0.018 K/sec ( +- 10.63% ) (23.53%)

10.863678 +- 0.000804 seconds time elapsed ( +- 0.01% )

L1-dcache-load-misses (L1D.REPLACEMENT) is substantially lower which
suggests that, as expected, we aren't doing write-allocate or RFO.

Note that the IPC and instruction counts etc are quite different, but
that's just an artifact of switching from a single 'REP; STOSB' per
PAGE_SIZE region to a MOVNTI loop.

The page-clearing BW is substantially higher (~100% or more), so enable
X86_FEATURE_NT_GOOD for Intel Broadwellx.

Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
arch/x86/kernel/cpu/intel.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 59a1e3ce3f14..161028c1dee0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -662,6 +662,8 @@ static void init_intel(struct cpuinfo_x86 *c)
c->x86_cache_alignment = c->x86_clflush_size * 2;
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
+ if (c->x86 == 6 && c->x86_model == INTEL_FAM6_BROADWELL_X)
+ set_cpu_cap(c, X86_FEATURE_NT_GOOD);
#else
/*
* Names for the Pentium II/Celeron processors
--
2.9.3