Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()

From: Baolin Wang

Date: Fri Mar 06 2026 - 21:14:50 EST




On 3/7/26 5:20 AM, Barry Song wrote:
On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:

Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
batched checking of young flags and TLB flushing, improving performance during
large folio reclamation.

Performance testing:
Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
33% performance improvement on my Arm64 32-core server (and 10%+ improvement
on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
from approximately 35% to around 5%.

W/o patchset:
real 0m1.518s
user 0m0.000s
sys 0m1.518s

W/ patchset:
real 0m1.018s
user 0m0.000s
sys 0m1.018s

Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>

Reviewed-by: Barry Song <baohua@xxxxxxxxxx>

Thanks Barry. But this series has been upstreamed, I can not add your reviewed tag.


---
arch/arm64/include/asm/pgtable.h | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 3dabf5ea17fa..a17eb8a76788 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
}

+#define clear_flush_young_ptes clear_flush_young_ptes
+static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
+ unsigned long addr, pte_t *ptep,
+ unsigned int nr)
+{
+ if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
+ return __ptep_clear_flush_young(vma, addr, ptep);
+
+ return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
+}

A similar question arises here:

If nr = 4 for 16KB large folios and one of those entries is young,
we end up flushing the TLB for all 4 PTEs.

If all four entries are young, we win; if only one is young, it seems
we flush 3 redundant pages. but arm64 has TLB coalescing, so
maybe they are just one TLB?

We discussed a similar issue in the previous thread [1], and I quote some comments from Ryan:

"
My concern was the opportunity cost of evicting the entries for all the
non-accessed parts of the folio from the TLB. But of course, I'm talking
nonsense because the architecture does not allow caching non-accessed entries in the TLB.
"

[1] https://lore.kernel.org/all/02239ca7-9701-4bfa-af0f-dcf0d05a3e89@xxxxxxxxxxxxxxxxx/