Re: [PATCH v5 1/5] mm: rmap: support batched checks of the references for large folios
From: Baolin Wang
Date: Mon Feb 09 2026 - 04:26:28 EST
On 2/9/26 5:20 PM, David Hildenbrand (Arm) wrote:
On 2/9/26 10:14, Baolin Wang wrote:
On 2/9/26 4:49 PM, David Hildenbrand (Arm) wrote:
On 12/26/25 07:07, Baolin Wang wrote:
Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.
Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).
Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation. And it will be overridden by the architecture
that implements a more efficient batch operation in the following patches.
While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch operation.
Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
---
include/linux/mmu_notifier.h | 9 +++++----
include/linux/pgtable.h | 31 +++++++++++++++++++++++++++++++
mm/rmap.c | 31 ++++++++++++++++++++++++++++---
3 files changed, 64 insertions(+), 7 deletions(-)
diff --git a/include/linux/mmu_notifier.h b/include/linux/ mmu_notifier.h
index d1094c2d5fb6..07a2bbaf86e9 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
range->owner = owner;
}
-#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
+#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr) \
({ \
int __young; \
struct vm_area_struct *___vma = __vma; \
unsigned long ___address = __address; \
- __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
+ unsigned int ___nr = __nr; \
+ __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \
__young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
___address, \
___address + \
- PAGE_SIZE); \
+ ___nr * PAGE_SIZE); \
__young; \
})
Man that's ugly, Not your fault, but can this possibly be turned into an inline function in a follow-up patch.
Yes, the cleanup of these macros is already in my follow-up patch set.
+#ifndef clear_flush_young_ptes
+/**
+ * clear_flush_young_ptes - Clear the access bit and perform a TLB flush for PTEs
+ * that map consecutive pages of the same folio.
With clear_young_dirty_ptes() description in mind, this should probably be "Mark PTEs that map consecutive pages of the same folio as clean and flush the TLB" ?
IMO, “clean” is confusing here, as it sounds like clear the dirty bit to make the folio clean.
"as old", sorry, I used the wrong part of the description.
OK.
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear access bit.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_clear_flush_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ unsigned int nr)
Two-tab alignment on second+ line like all similar functions here.
Sure.
+{
+ int i, young = 0;
+
+ for (i = 0; i < nr; ++i, ++ptep, addr += PAGE_SIZE)
+ young |= ptep_clear_flush_young(vma, addr, ptep);
+
Why don't we use a similar loop we use in clear_young_dirty_ptes() or clear_full_ptes() etc? It's not only consistent but also optimizes out the first check for nr.
for (;;) {
young |= ptep_clear_flush_young(vma, addr, ptep);
if (--nr == 0)
break;
ptep++;
addr += PAGE_SIZE;
}
We’ve discussed this loop pattern before [1], and it seems that people prefer the ‘for (;;)’ loop. Do you have a strong preference for changing it back?
Yes, to make all such helpers look consistent. Note that your version was also not consistent with the other variants.
Ryans point was about avoiding two ptep_clear_flush_young() calls, which the for(;;) avoids as well.
Actually my v2[1] is following the previous pattern, anyway let me change it back.
[1] https://lore.kernel.org/all/545dba5e899634bc6c8ca782417d16fef3bd049f.1765439381.git.baolin.wang@xxxxxxxxxxxxxxxxx/