Re: [PATCH v10 8/8] mm: folio_zero_user: cache neighbouring pages
From: David Hildenbrand (Red Hat)
Date: Tue Dec 23 2025 - 05:11:40 EST
On 12/18/25 22:23, Ankur Arora wrote:
Ankur Arora <ankur.a.arora@xxxxxxxxxx> writes:
David Hildenbrand (Red Hat) <david@xxxxxxxxxx> writes:
On 12/15/25 21:49, Ankur Arora wrote:
folio_zero_user() does straight zeroing without caring about
temporal locality for caches.
This replaced commit c6ddfb6c5890 ("mm, clear_huge_page: move order
algorithm into a separate function") where we cleared a page at a
time converging to the faulting page from the left and the right.
To retain limited temporal locality, split the clearing in three
parts: the faulting page and its immediate neighbourhood, and, the
remaining regions on the left and the right. The local neighbourhood
will be cleared last.
Do this only when zeroing small folios (< MAX_ORDER_NR_PAGES) since
there isn't much expectation of cache locality for large folios.
Performance
===
AMD Genoa (EPYC 9J14, cpus=2 sockets * 96 cores * 2 threads,
memory=2.2 TB, L1d= 16K/thread, L2=512K/thread, L3=2MB/thread)
anon-w-seq (vm-scalability):
stime utime
page-at-a-time 1654.63 ( +- 3.84% ) 811.00 ( +- 3.84% )
contiguous clearing 1602.86 ( +- 3.00% ) 970.75 ( +- 4.68% )
neighbourhood-last 1630.32 ( +- 2.73% ) 886.37 ( +- 5.19% )
Both stime and utime respond in expected ways. stime drops for both
contiguous clearing (-3.14%) and neighbourhood-last (-1.46%)
approaches. However, utime increases for both contiguous clearing
(+19.7%) and neighbourhood-last (+9.28%).
In part this is because anon-w-seq runs with 384 processes zeroing
anonymously mapped memory which they then access sequentially. As
such this is likely an uncommon pattern where the memory bandwidth
is saturated while also being cache limited because we access the
entire region.
Kernel make workload (make -j 12 bzImage):
stime utime
page-at-a-time 138.16 ( +- 0.31% ) 1015.11 ( +- 0.05% )
contiguous clearing 133.42 ( +- 0.90% ) 1013.49 ( +- 0.05% )
neighbourhood-last 131.20 ( +- 0.76% ) 1011.36 ( +- 0.07% )
For make the utime stays relatively flat with an up to 4.9% improvement
in the stime.
Nice evaluation!
Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
Reviewed-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
---
mm/memory.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 42 insertions(+), 2 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 974c48db6089..d22348b95227 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -7268,13 +7268,53 @@ static void clear_contig_highpages(struct page *page, unsigned long addr,
* @addr_hint: The address accessed by the user or the base address.
*
* Uses architectural support to clear page ranges.
+ *
+ * Clearing of small folios (< MAX_ORDER_NR_PAGES) is split in three parts:
+ * pages in the immediate locality of the faulting page, and its left, right
+ * regions; the local neighbourhood is cleared last in order to keep cache
+ * lines of the faulting region hot.
+ *
+ * For larger folios we assume that there is no expectation of cache locality
+ * and just do a straight zero.
Just wondering: why not do the same thing here as well? Probably shouldn't hurt
and would get rid of some code?
That's a good point. With only a three way split, there's no reason to
treat large folios specially.
A bit more on this: this change makes sense but I'll retain the current
split between patches-7, 8.
Where patch-7, is used to justify using contiguous clearing (and the
choice of value for PROCESS_PAGES_NON_PREEMPT_BATCH), unit based on
preemption model etc and patch-8, for the neighbourhood optimization.
*/
void folio_zero_user(struct folio *folio, unsigned long addr_hint)
{
unsigned long base_addr = ALIGN_DOWN(addr_hint, folio_size(folio));
While at it you could turn that const as well.
Ack.
+ const long fault_idx = (addr_hint - base_addr) / PAGE_SIZE;
+ const struct range pg = DEFINE_RANGE(0, folio_nr_pages(folio) - 1);
+ const int width = 2; /* number of pages cleared last on either side */
Is "width" really the right terminology? (the way you describe it, it's more
like diameter?)
I like diameter. Will make that a define.
I'll make that radius since that's how I'm using it.
All makes sense to me.
--
Cheers
David