[PATCH v2 6/9] x86/clear_huge_page: multi-page clearing

From: Ankur Arora
Date: Wed Aug 30 2023 - 16:24:00 EST


clear_pages_rep(), clear_pages_erms() clear using string instructions.
While clearing extents of more than a single page, we can use these
more effectively by explicitly advertising the region-size to the
processor.

This can be used as a hint by the processor-uarch to optimize the
clearing (ex. to avoid polluting one or more levels of the data-cache.)

As a secondary benefit, string instructions are typically microcoded,
and so it's a good idea to amortize the cost of the decode across larger
regions.

Accordingly, clear_huge_page() now does huge-page clearing in three
parts: the neighbourhood of the faulting address, the left, and the
right region of the neighbourhood.

The local neighbourhood is cleared last to keep its cachelines hot.

Performance
==

Use mmap(MAP_HUGETLB) to demand fault a 128GB region (on the local
NUMA node):

Milan (EPYC 7J13, boost=1):

mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 14.55 19.29 +32.5%
pg-sz=1GB 19.34 49.60 +156.4%

Milan uses a threshold of LLC-size (~32MB) for eliding cacheline
allocation, so we see a dropoff in cacheline-allocations for
pg-sz=1GB:

pg-sz=1GB:
-23,088,001,347 cycles # 3.487 GHz ( +- 0.08% ) (35.68%)
- 4,680,678,939 L1-dcache-loads # 706.831 M/sec ( +- 0.02% ) (35.74%)
- 2,150,395,280 L1-dcache-load-misses # 45.93% of all L1-dcache accesses ( +- 0.01% ) (35.74%)

+ 8,983,798,764 cycles # 3.489 GHz ( +- 0.05% ) (35.59%)
+ 18,294,725 L1-dcache-loads # 7.104 M/sec ( +- 18.88% ) (35.78%)
+ 6,677,565 L1-dcache-load-misses # 30.48% of all L1-dcache accesses ( +- 20.72% ) (35.78%)

That's not the case with pg-sz=2MB, where we perform better but the
number of cacheline allocations remain the same:

pg-sz=2MB:
-31,087,683,852 cycles # 3.494 GHz ( +- 0.17% ) (35.72%)
- 4,898,684,886 L1-dcache-loads # 550.627 M/sec ( +- 0.03% ) (35.71%)
- 2,161,434,236 L1-dcache-load-misses # 44.11% of all L1-dcache accesses ( +- 0.01% ) (35.71%)

+23,368,914,596 cycles # 3.480 GHz ( +- 0.27% ) (35.72%)
+ 4,481,808,430 L1-dcache-loads # 667.382 M/sec ( +- 0.03% ) (35.71%)
+ 2,170,453,309 L1-dcache-load-misses # 48.41% of all L1-dcache accesses ( +- 0.06% ) (35.71%)


Icelakex (Platinum 8358, no_turbo=0):

mm/clear_huge_page x86/clear_huge_page change
(GB/s) (GB/s)

pg-sz=2MB 9.19 12.94 +40.8%
pg-sz=1GB 9.36 12.97 +38.5%

For both page-sizes, Icelakex, behaves similarly to Milan pg-sz=2MB: we
see a drop in cycles but there's no drop in cacheline allocation.

Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
arch/x86/mm/hugetlbpage.c | 54 +++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 5804bbae4f01..0b9f7a6dad93 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -148,6 +148,60 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
return hugetlb_get_unmapped_area_topdown(file, addr, len,
pgoff, flags);
}
+
+#ifndef CONFIG_HIGHMEM
+static void clear_contig_region(struct page *page, unsigned int npages)
+{
+ clear_pages(page_address(page), npages);
+}
+
+/*
+ * clear_huge_page(): multi-page clearing variant of clear_huge_page().
+ *
+ * Taking inspiration from the common code variant, we split the zeroing in
+ * three parts: left of the fault, right of the fault, and up to 5 pages
+ * in the immediate neighbourhood of the target page.
+ *
+ * Cleared in that order to keep cache lines of the target region hot.
+ *
+ * For gigantic pages, there is no expectation of cache locality so we do a
+ * straight zeroing.
+ */
+void clear_huge_page(struct page *page,
+ unsigned long addr_hint, unsigned int pages_per_huge_page)
+{
+ unsigned long addr = addr_hint &
+ ~(((unsigned long)pages_per_huge_page << PAGE_SHIFT) - 1);
+ const long pgidx = (addr_hint - addr) / PAGE_SIZE;
+ const int first_pg = 0, last_pg = pages_per_huge_page - 1;
+ const int width = 2; /* pages cleared last on either side */
+ int sidx[3], eidx[3];
+ int i, n;
+
+ if (pages_per_huge_page > MAX_ORDER_NR_PAGES)
+ return clear_contig_region(page, pages_per_huge_page);
+
+ /*
+ * Neighbourhood of the fault. Cleared at the end to ensure
+ * it sticks around in the cache.
+ */
+ n = 2;
+ sidx[n] = (pgidx - width) < first_pg ? first_pg : (pgidx - width);
+ eidx[n] = (pgidx + width) > last_pg ? last_pg : (pgidx + width);
+
+ sidx[0] = first_pg; /* Region to the left of the fault */
+ eidx[0] = sidx[n] - 1;
+
+ sidx[1] = eidx[n] + 1; /* Region to the right of the fault */
+ eidx[1] = last_pg;
+
+ for (i = 0; i <= 2; i++) {
+ if (eidx[i] >= sidx[i])
+ clear_contig_region(page + sidx[i],
+ eidx[i] - sidx[i] + 1);
+ }
+}
+#endif /* CONFIG_HIGHMEM */
#endif /* CONFIG_HUGETLB_PAGE */

#ifdef CONFIG_X86_64
--
2.31.1