[PATCH 6/9] mm/clear_huge_page: use multi-page clearing

From: Ankur Arora
Date: Mon Apr 03 2023 - 01:23:54 EST


clear_pages_rep(), clear_pages_erms() use string instructions
internally. These, unlike a MOV loop, allow us to explicitly advertise
the region-size to the processor. Thus, clearing in multi-page chunks
means we can specify the real region sizes (or close to it) which is
good for two reasons:

- region-size can serve as a hint to current (some AMD Zen models) and
possibly future uarchs which can use this hint to avoid polluting one
or more levels of the dcache.

- string instructions are typically microcoded, and would be cheaper
if amortized across larger regions. We also execute fewer loop
iterations (ex. a cond_resched() check for each page but those
instructions are likely free.)

clear_huge_page() now clears in three sections: the local neighbourhood
of the faulting address (faulting page, and four surrounding pages),
and its left and right regions.

The local neighbourhood is cleared last to keep its cachelines hot.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 64GB region (on the local
NUMA node):

Icelakex (Platinum 8358, ucode=0xd0002c1, no_turbo=1):

mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 8.76 11.82 +34.93%
pg-sz=1GB 8.99 12.18 +35.48%

On Icelakex we continue to allocate cachelines:

pg-sz=2MB:
- 701,951,397 L1-dcache-loads # 47.985 M/sec ( +- 19.22% ) (69.23%)
- 3,239,403,770 L1-dcache-load-misses # 691.17% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
+ 194,318,641 L1-dcache-loads # 17.905 M/sec ( +- 19.07% ) (69.25%)
+ 3,238,878,229 L1-dcache-load-misses # 2480.93% of all L1-dcache accesses ( +- 19.25% ) (69.26%)

pg-sz=1GB:
- 532,232,051 L1-dcache-loads # 37.378 M/sec ( +- 19.25% ) (69.23%)
- 3,224,574,249 L1-dcache-load-misses # 909.02% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
+ 22,587,703 L1-dcache-loads # 2.150 M/sec ( +- 19.38% ) (69.25%)
+ 3,223,143,697 L1-dcache-load-misses # 21478.37% of all L1-dcache accesses ( +- 19.25% ) (69.25%)


Milan (EPYC 7J13, ucode=0xa0011a9, boost=0):

mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 12.24 17.54 +43.30%
pg-sz=1GB 17.98 37.24 +107.11%

Milan uses a threshold ~32MB for eliding cacheline allocation, so we
see a dropoff in cacheline-allocations for pg-sz=1GB:

pg-sz=2MB:
- 2,495,566,569 L1-dcache-loads # 476.417 M/sec ( +- 0.04% ) (33.38%)
- 1,079,711,798 L1-dcache-load-misses # 43.28% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
+ 2,235,310,058 L1-dcache-loads # 610.770 M/sec ( +- 0.02% ) (33.37%)
+ 1,089,602,355 L1-dcache-load-misses # 48.73% of all L1-dcache accesses ( +- 0.01% ) (33.37%)

pg-sz=1GB:
- 2,417,846,489 L1-dcache-loads # 679.753 M/sec ( +- 0.01% ) (33.38%)
- 1,075,531,869 L1-dcache-load-misses # 44.49% of all L1-dcache accesses ( +- 0.01% ) (33.35%)
+ 31,159,378 L1-dcache-loads # 18.119 M/sec ( +- 3.27% ) (33.46%)
+ 14,692,358 L1-dcache-load-misses # 48.21% of all L1-dcache accesses ( +- 3.12% ) (33.46%)

Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---

Fuller perf stats for context:

# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

21,945.59 msec task-clock # 2.999 CPUs utilized ( +- 19.25% )
34 context-switches # 2.324 /sec ( +- 20.38% )
3 cpu-migrations # 0.205 /sec ( +- 19.25% )
198,152 page-faults # 13.546 K/sec ( +- 19.29% )
56,513,364,885 cycles # 3.863 GHz ( +- 19.25% ) (38.44%)
2,583,719,806 instructions # 0.07 insn per cycle ( +- 19.24% ) (46.14%)
585,212,952 branches # 40.005 M/sec ( +- 19.23% ) (53.83%)
562,164 branch-misses # 0.14% of all branches ( +- 19.23% ) (61.53%)
282,621,312,162 slots # 19.320 G/sec ( +- 19.25% ) (69.22%)
11,048,627,225 topdown-retiring # 3.8% Retiring ( +- 19.22% ) (69.22%)
34,358,400,894 topdown-bad-spec # 11.5% Bad Speculation ( +- 19.57% ) (69.22%)
2,231,092,499 topdown-fe-bound # 0.8% Frontend Bound ( +- 19.25% ) (69.22%)
246,679,210,776 topdown-be-bound # 84.0% Backend Bound ( +- 19.21% ) (69.22%)
701,951,397 L1-dcache-loads # 47.985 M/sec ( +- 19.22% ) (69.23%)
3,239,403,770 L1-dcache-load-misses # 691.17% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
11,475,685 LLC-loads # 784.475 K/sec ( +- 19.23% ) (69.25%)
793,272 LLC-load-misses # 10.36% of all LL-cache accesses ( +- 19.23% ) (69.25%)
17,821,045 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.51% ) (30.77%)
693,339,354 dTLB-loads # 47.397 M/sec ( +- 19.33% ) (30.76%)
637,811 dTLB-load-misses # 0.14% of all dTLB cache accesses ( +- 19.09% ) (30.75%)
131,922 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.59% ) (30.75%)

7.31681 +- 0.00177 seconds time elapsed ( +- 0.02% )


# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

16,276.28 msec task-clock # 2.999 CPUs utilized ( +- 19.24% )
27 context-switches # 2.488 /sec ( +- 19.25% )
3 cpu-migrations # 0.276 /sec ( +- 19.25% )
196,935 page-faults # 18.146 K/sec ( +- 19.25% )
41,906,597,608 cycles # 3.861 GHz ( +- 19.24% ) (38.44%)
729,479,932 instructions # 0.03 insn per cycle ( +- 19.38% ) (46.14%)
133,969,095 branches # 12.344 M/sec ( +- 19.35% ) (53.84%)
412,818 branch-misses # 0.46% of all branches ( +- 18.97% ) (61.54%)
209,574,316,961 slots # 19.311 G/sec ( +- 19.24% ) (69.24%)
4,933,512,982 topdown-retiring # 2.3% Retiring ( +- 19.24% ) (69.24%)
20,272,641,267 topdown-bad-spec # 9.4% Bad Speculation ( +- 19.51% ) (69.24%)
837,421,487 topdown-fe-bound # 0.4% Frontend Bound ( +- 19.24% ) (69.24%)
190,089,232,476 topdown-be-bound # 88.0% Backend Bound ( +- 19.19% ) (69.24%)
194,318,641 L1-dcache-loads # 17.905 M/sec ( +- 19.07% ) (69.25%)
3,238,878,229 L1-dcache-load-misses # 2480.93% of all L1-dcache accesses ( +- 19.25% ) (69.26%)
10,560,508 LLC-loads # 973.081 K/sec ( +- 19.23% ) (69.26%)
724,884 LLC-load-misses # 10.28% of all LL-cache accesses ( +- 17.15% ) (69.26%)
14,378,070 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.13% ) (30.75%)
185,562,230 dTLB-loads # 17.098 M/sec ( +- 19.74% ) (30.74%)
617,978 dTLB-load-misses # 0.51% of all dTLB cache accesses ( +- 18.72% ) (30.74%)
112,509 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.76% ) (30.74%)

5.42697 +- 0.00152 seconds time elapsed ( +- 0.03% )


# Icelakex, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):

21,361.22 msec task-clock # 2.999 CPUs utilized ( +- 19.25% )
23 context-switches # 1.615 /sec ( +- 18.95% )
3 cpu-migrations # 0.211 /sec ( +- 19.25% )
701 page-faults # 49.230 /sec ( +- 19.27% )
54,981,958,487 cycles # 3.861 GHz ( +- 19.25% ) (38.44%)
2,012,625,953 instructions # 0.05 insn per cycle ( +- 19.25% ) (46.14%)
470,264,509 branches # 33.026 M/sec ( +- 19.25% ) (53.83%)
194,801 branch-misses # 0.06% of all branches ( +- 18.88% ) (61.53%)
274,966,507,627 slots # 19.311 G/sec ( +- 19.25% ) (69.22%)
10,555,137,650 topdown-retiring # 3.8% Retiring ( +- 19.04% ) (69.22%)
21,206,785,918 topdown-bad-spec # 7.8% Bad Speculation ( +- 18.13% ) (69.22%)
1,094,597,329 topdown-fe-bound # 0.4% Frontend Bound ( +- 19.25% ) (69.22%)
244,462,123,545 topdown-be-bound # 88.0% Backend Bound ( +- 19.33% ) (69.22%)
532,232,051 L1-dcache-loads # 37.378 M/sec ( +- 19.25% ) (69.23%)
3,224,574,249 L1-dcache-load-misses # 909.02% of all L1-dcache accesses ( +- 19.25% ) (69.24%)
2,318,195 LLC-loads # 162.804 K/sec ( +- 19.35% ) (69.25%)
206,737 LLC-load-misses # 13.44% of all LL-cache accesses ( +- 18.30% ) (69.25%)
4,950,866 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.26% ) (30.77%)
531,299,560 dTLB-loads # 37.313 M/sec ( +- 19.24% ) (30.76%)
2,811 dTLB-load-misses # 0.00% of all dTLB cache accesses ( +- 17.25% ) (30.75%)
26,355 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.58% ) (30.75%)

7.12187 +- 0.00190 seconds time elapsed ( +- 0.03% )


# Icelakex, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64 --huge 2' (3 runs):

15,764.52 msec task-clock # 2.999 CPUs utilized ( +- 19.25% )
17 context-switches # 1.618 /sec ( +- 20.47% )
3 cpu-migrations # 0.285 /sec ( +- 19.25% )
700 page-faults # 66.614 /sec ( +- 19.22% )
40,560,984,582 cycles # 3.860 GHz ( +- 19.25% ) (38.45%)
79,578,792 instructions # 0.00 insn per cycle ( +- 19.24% ) (46.15%)
13,872,134 branches # 1.320 M/sec ( +- 19.23% ) (53.85%)
119,492 branch-misses # 1.29% of all branches ( +- 18.80% ) (61.55%)
202,854,573,160 slots # 19.304 G/sec ( +- 19.25% ) (69.25%)
3,982,417,725 topdown-retiring # 2.0% Retiring ( +- 19.25% ) (69.25%)
13,523,424,635 topdown-bad-spec # 6.8% Bad Speculation ( +- 18.69% ) (69.25%)
18,661,431 topdown-fe-bound # 0.0% Frontend Bound ( +- 19.28% ) (69.25%)
185,884,147,789 topdown-be-bound # 91.3% Backend Bound ( +- 19.28% ) (69.25%)
22,587,703 L1-dcache-loads # 2.150 M/sec ( +- 19.38% ) (69.25%)
3,223,143,697 L1-dcache-load-misses # 21478.37% of all L1-dcache accesses ( +- 19.25% ) (69.25%)
1,777,675 LLC-loads # 169.169 K/sec ( +- 19.60% ) (69.25%)
126,583 LLC-load-misses # 10.77% of all LL-cache accesses ( +- 19.82% ) (69.25%)
3,333,729 L1-icache-load-misses # 0.00% of all L1-icache accesses ( +- 19.49% ) (30.75%)
19,999,517 dTLB-loads # 1.903 M/sec ( +- 19.38% ) (30.75%)
1,833 dTLB-load-misses # 0.01% of all dTLB cache accesses ( +- 17.72% ) (30.75%)
34,066 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +- 19.09% ) (30.75%)

5.25624 +- 0.00176 seconds time elapsed ( +- 0.03% )


# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=2mb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

5,241.76 msec task-clock # 1.000 CPUs utilized ( +- 0.08% )
10 context-switches # 1.909 /sec ( +- 8.82% )
1 cpu-migrations # 0.191 /sec
65,636 page-faults # 12.530 K/sec ( +- 0.00% )
12,730,694,768 cycles # 2.430 GHz ( +- 0.08% ) (33.31%)
36,709,243 stalled-cycles-frontend # 0.29% frontend cycles idle ( +- 24.07% ) (33.32%)
37,520,225 stalled-cycles-backend # 0.29% backend cycles idle ( +- 9.87% ) (33.34%)
874,896,010 instructions # 0.07 insn per cycle
# 0.05 stalled cycles per insn ( +- 1.23% ) (33.36%)
199,308,386 branches # 38.049 M/sec ( +- 0.84% ) (33.38%)
441,428 branch-misses # 0.22% of all branches ( +- 4.68% ) (33.38%)
2,495,566,569 L1-dcache-loads # 476.417 M/sec ( +- 0.04% ) (33.38%)
1,079,711,798 L1-dcache-load-misses # 43.28% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
50,936,391 L1-icache-loads # 9.724 M/sec ( +- 1.29% ) (33.35%)
284,407 L1-icache-load-misses # 0.56% of all L1-icache accesses ( +- 4.60% ) (33.33%)
546,596 dTLB-loads # 104.348 K/sec ( +- 6.14% ) (33.31%)
231,897 dTLB-load-misses # 42.08% of all dTLB cache accesses ( +- 4.27% ) (33.29%)
6 iTLB-loads # 1.145 /sec ( +- 72.65% ) (33.29%)
34,065 iTLB-load-misses # 262038.46% of all iTLB cache accesses ( +- 44.88% ) (33.29%)
18,237,487 L1-dcache-prefetches # 3.482 M/sec ( +- 12.84% ) (33.29%)

5.23915 +- 0.00421 seconds time elapsed ( +- 0.08% )

# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=2mb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 1' (3 runs):

3,655.71 msec task-clock # 0.999 CPUs utilized ( +- 0.13% )
7 context-switches # 1.913 /sec ( +- 8.25% )
1 cpu-migrations # 0.273 /sec
65,636 page-faults # 17.934 K/sec ( +- 0.00% )
8,879,727,514 cycles # 2.426 GHz ( +- 0.13% ) (33.26%)
5,733,380 stalled-cycles-frontend # 0.06% frontend cycles idle ( +-170.04% ) (33.28%)
42,012,302 stalled-cycles-backend # 0.47% backend cycles idle ( +- 24.51% ) (33.31%)
214,672,610 instructions # 0.02 insn per cycle
# 0.28 stalled cycles per insn ( +- 1.71% ) (33.34%)
42,298,268 branches # 11.557 M/sec ( +- 1.28% ) (33.36%)
267,936 branch-misses # 0.62% of all branches ( +- 7.80% ) (33.37%)
2,235,310,058 L1-dcache-loads # 610.770 M/sec ( +- 0.02% ) (33.37%)
1,089,602,355 L1-dcache-load-misses # 48.73% of all L1-dcache accesses ( +- 0.01% ) (33.37%)
48,725,812 L1-icache-loads # 13.314 M/sec ( +- 0.25% ) (33.37%)
231,227 L1-icache-load-misses # 0.47% of all L1-icache accesses ( +- 13.20% ) (33.37%)
280,655 dTLB-loads # 76.685 K/sec ( +- 13.33% ) (33.37%)
151,028 dTLB-load-misses # 44.02% of all dTLB cache accesses ( +- 6.64% ) (33.35%)
15 iTLB-loads # 4.099 /sec ( +- 6.67% ) (33.32%)
121,208 iTLB-load-misses # 865771.43% of all iTLB cache accesses ( +- 2.74% ) (33.29%)
18,702,209 L1-dcache-prefetches # 5.110 M/sec ( +- 12.51% ) (33.27%)

3.66065 +- 0.00461 seconds time elapsed ( +- 0.13% )


# Milan, baseline (mm/clear_huge_page), region-sz=64g, pg-sz=1gb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):

3,544.20 msec task-clock # 0.996 CPUs utilized ( +- 0.21% )
5 context-switches # 1.406 /sec ( +- 6.67% )
1 cpu-migrations # 0.281 /sec
227 page-faults # 63.819 /sec ( +- 0.15% )
8,609,810,964 cycles # 2.421 GHz ( +- 0.21% ) (33.30%)
77,420,424 stalled-cycles-frontend # 0.90% frontend cycles idle ( +- 20.55% ) (33.33%)
25,197,541 stalled-cycles-backend # 0.29% backend cycles idle ( +- 1.09% ) (33.35%)
658,146,061 instructions # 0.08 insn per cycle
# 0.16 stalled cycles per insn ( +- 0.04% ) (33.38%)
154,867,131 branches # 43.539 M/sec ( +- 0.04% ) (33.41%)
167,531 branch-misses # 0.11% of all branches ( +- 5.19% ) (33.41%)
2,417,846,489 L1-dcache-loads # 679.753 M/sec ( +- 0.01% ) (33.38%)
1,075,531,869 L1-dcache-load-misses # 44.49% of all L1-dcache accesses ( +- 0.01% ) (33.35%)
12,835,321 L1-icache-loads # 3.609 M/sec ( +- 0.41% ) (33.33%)
55,282 L1-icache-load-misses # 0.43% of all L1-icache accesses ( +- 1.98% ) (33.30%)
23,287 dTLB-loads # 6.547 K/sec ( +- 15.61% ) (33.29%)
1,333 dTLB-load-misses # 4.48% of all dTLB cache accesses ( +- 1.26% ) (33.29%)
3 iTLB-loads # 0.843 /sec ( +- 33.33% ) (33.29%)
231 iTLB-load-misses # 11550.00% of all iTLB cache accesses ( +- 6.14% ) (33.29%)
170,608,062 L1-dcache-prefetches # 47.965 M/sec ( +- 0.84% ) (33.29%)

3.55776 +- 0.00738 seconds time elapsed ( +- 0.21% )


# Milan, multi-page (x86/clear_huge_page), region-sz=64g, pg-sz=1gb

Performance counter stats for 'taskset -c 15 bench/pf-test --sz 64g --huge 2' (3 runs):

1,718.27 msec task-clock # 0.999 CPUs utilized ( +- 0.08% )
6 context-switches # 3.489 /sec ( +- 14.70% )
1 cpu-migrations # 0.581 /sec
227 page-faults # 132.000 /sec ( +- 0.15% )
4,176,107,493 cycles # 2.428 GHz ( +- 0.08% ) (33.19%)
2,675,797 stalled-cycles-frontend # 0.06% frontend cycles idle ( +- 0.34% ) (33.25%)
147,394,527 stalled-cycles-backend # 3.53% backend cycles idle ( +- 8.80% ) (33.31%)
12,779,784 instructions # 0.00 insn per cycle
# 13.14 stalled cycles per insn ( +- 0.09% ) (33.37%)
2,428,829 branches # 1.412 M/sec ( +- 0.08% ) (33.42%)
63,460 branch-misses # 2.61% of all branches ( +- 3.48% ) (33.46%)
31,159,378 L1-dcache-loads # 18.119 M/sec ( +- 3.27% ) (33.46%)
14,692,358 L1-dcache-load-misses # 48.21% of all L1-dcache accesses ( +- 3.12% ) (33.46%)
2,556,688 L1-icache-loads # 1.487 M/sec ( +- 0.89% ) (33.46%)
21,148 L1-icache-load-misses # 0.84% of all L1-icache accesses ( +- 0.25% ) (33.41%)
6,114 dTLB-loads # 3.555 K/sec ( +- 12.76% ) (33.35%)
1,742 dTLB-load-misses # 33.73% of all dTLB cache accesses ( +- 21.79% ) (33.29%)
45 iTLB-loads # 26.167 /sec ( +- 7.52% ) (33.23%)
90 iTLB-load-misses # 210.94% of all iTLB cache accesses ( +- 21.20% ) (33.17%)
257,942 L1-dcache-prefetches # 149.993 K/sec ( +- 9.84% ) (33.17%)

1.72042 +- 0.00139 seconds time elapsed ( +- 0.08% )

---
arch/x86/mm/hugetlbpage.c | 49 +++++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..4294b77c4f18 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,55 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
return hugetlb_get_unmapped_area_topdown(file, addr, len,
pgoff, flags);
}
+
+/*
+ * This is used on all !CONFIG_HIGHMEM configurations.
+ *
+ * CONFIG_HIGHMEM, falls back to the __weak version.
+ */
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned long vaddr,
+ unsigned int npages)
+{
+ clear_user_pages(page_address(page), vaddr, page, npages);
+}
+
+void clear_huge_page(struct page *page,
+ unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+ unsigned long addr = addr_hint &
+ ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+ const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+ const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+ const int width = 2; /* pages cleared last on either side */
+ int sidx[3], eidx[3];
+ int i, n;
+
+ if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+ return clear_contig_region(page, addr, pages_per_huge_page);
+
+ /*
+ * Neighbourhood of the fault. Cleared at the end to ensure
+ * it sticks around in the cache.
+ */
+ n = 2;
+ sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+ eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width);
+
+ sidx[0] = first_pg; /* Region to the left of the fault */
+ eidx[0] = sidx[n] - 1;
+
+ sidx[1] = eidx[n] + 1; /* Region to the right of the fault */
+ eidx[1] = last_pg;
+
+ for (i = 0; i <= 2; i++) {
+ if (eidx[i] >= sidx[i])
+ clear_contig_region(page + sidx[i],
+ addr + sidx[i] * PAGE_SIZE,
+ eidx[i] - sidx[i] + 1);
+ }
+}
+#endif /* CONFIG_HIGHMEM */
#endif /* CONFIG_HUGETLB_PAGE */

#ifdef CONFIG_X86_64
--
2.31.1