Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios

From: Baolin Wang

Date: Mon Mar 09 2026 - 21:38:10 EST




On 3/7/26 4:02 PM, Barry Song wrote:
On Sat, Mar 7, 2026 at 10:22 AM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:



On 3/7/26 5:07 AM, Barry Song wrote:
On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

Currently, folio_referenced_one() always checks the young flag for each PTE
sequentially, which is inefficient for large folios. This inefficiency is
especially noticeable when reclaiming clean file-backed large folios, where
folio_referenced() is observed as a significant performance hotspot.

Moreover, on Arm64 architecture, which supports contiguous PTEs, there is already
an optimization to clear the young flags for PTEs within a contiguous range.
However, this is not sufficient. We can extend this to perform batched operations
for the entire large folio (which might exceed the contiguous range: CONT_PTE_SIZE).

Introduce a new API: clear_flush_young_ptes() to facilitate batched checking
of the young flags and flushing TLB entries, thereby improving performance
during large folio reclamation. And it will be overridden by the architecture
that implements a more efficient batch operation in the following patches.

While we are at it, rename ptep_clear_flush_young_notify() to
clear_flush_young_ptes_notify() to indicate that this is a batch operation.

Reviewed-by: Harry Yoo <harry.yoo@xxxxxxxxxx>
Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>

LGTM,

Reviewed-by: Barry Song <baohua@xxxxxxxxxx>

Thanks.

---
include/linux/mmu_notifier.h | 9 +++++----
include/linux/pgtable.h | 35 +++++++++++++++++++++++++++++++++++
mm/rmap.c | 28 +++++++++++++++++++++++++---
3 files changed, 65 insertions(+), 7 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index d1094c2d5fb6..07a2bbaf86e9 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -515,16 +515,17 @@ static inline void mmu_notifier_range_init_owner(
range->owner = owner;
}

-#define ptep_clear_flush_young_notify(__vma, __address, __ptep) \
+#define clear_flush_young_ptes_notify(__vma, __address, __ptep, __nr) \
({ \
int __young; \
struct vm_area_struct *___vma = __vma; \
unsigned long ___address = __address; \
- __young = ptep_clear_flush_young(___vma, ___address, __ptep); \
+ unsigned int ___nr = __nr; \
+ __young = clear_flush_young_ptes(___vma, ___address, __ptep, ___nr); \
__young |= mmu_notifier_clear_flush_young(___vma->vm_mm, \
___address, \
___address + \
- PAGE_SIZE); \
+ ___nr * PAGE_SIZE); \
__young; \
})

@@ -650,7 +651,7 @@ static inline void mmu_notifier_subscriptions_destroy(struct mm_struct *mm)

#define mmu_notifier_range_update_to_read_only(r) false

-#define ptep_clear_flush_young_notify ptep_clear_flush_young
+#define clear_flush_young_ptes_notify clear_flush_young_ptes
#define pmdp_clear_flush_young_notify pmdp_clear_flush_young
#define ptep_clear_young_notify ptep_test_and_clear_young
#define pmdp_clear_young_notify pmdp_test_and_clear_young
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 21b67d937555..a50df42a893f 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1068,6 +1068,41 @@ static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
}
#endif

+#ifndef clear_flush_young_ptes
+/**
+ * clear_flush_young_ptes - Mark PTEs that map consecutive pages of the same
+ * folio as old and flush the TLB.
+ * @vma: The virtual memory area the pages are mapped into.
+ * @addr: Address the first page is mapped at.
+ * @ptep: Page table pointer for the first entry.
+ * @nr: Number of entries to clear access bit.
+ *
+ * May be overridden by the architecture; otherwise, implemented as a simple
+ * loop over ptep_clear_flush_young().
+ *
+ * Note that PTE bits in the PTE range besides the PFN can differ. For example,
+ * some PTEs might be write-protected.
+ *
+ * Context: The caller holds the page table lock. The PTEs map consecutive
+ * pages that belong to the same folio. The PTEs are all in the same PMD.
+ */
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep, unsigned int nr)
+{
+ int young = 0;
+
+ for (;;) {
+ young |= ptep_clear_flush_young(vma, addr, ptep);
+ if (--nr == 0)
+ break;
+ ptep++;
+ addr += PAGE_SIZE;
+ }
+
+ return young;
+}
+#endif

We might have an opportunity to batch the TLB synchronization,
using flush_tlb_range() instead of calling flush_tlb_page()
one by one. Not sure the benefit would be significant though,
especially if only one entry among nr has the young bit set.

Yes. In addition, this will involve many architectures’ implementations
and their differing TLB flush mechanisms, so it’s difficult to make a
reasonable per-architecture measurement. If any architecture has a more
efficient flush method, I’d prefer to implement an architecture‑specific
clear_flush_young_ptes().

Right! Since TLBI is usually quite expensive, I wonder if a generic
implementation for architectures lacking clear_flush_young_ptes()
might benefit from something like the below (just a very rough idea):

int clear_flush_young_ptes(struct vm_area_struct *vma,
unsigned long addr, pte_t *ptep, unsigned int nr)
{
unsigned long curr_addr = addr;
int young = 0;

while (nr--) {
young |= ptep_test_and_clear_young(vma, curr_addr, ptep);
ptep++;
curr_addr += PAGE_SIZE;
}

if (young)
flush_tlb_range(vma, addr, curr_addr);
return young;
}

I understand your point. I’m concerned that I can’t test this patch on every architecture to validate the benefits. Anyway, let me try this on my X86 machine first.