Re: [PATCH v6 4/5] arm64: mm: implement the architecture-specific clear_flush_young_ptes()
From: Barry Song
Date: Fri Mar 06 2026 - 16:20:54 EST
On Mon, Feb 9, 2026 at 10:07 PM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
>
> Implement the Arm64 architecture-specific clear_flush_young_ptes() to enable
> batched checking of young flags and TLB flushing, improving performance during
> large folio reclamation.
>
> Performance testing:
> Allocate 10G clean file-backed folios by mmap() in a memory cgroup, and try to
> reclaim 8G file-backed folios via the memory.reclaim interface. I can observe
> 33% performance improvement on my Arm64 32-core server (and 10%+ improvement
> on my X86 machine). Meanwhile, the hotspot folio_check_references() dropped
> from approximately 35% to around 5%.
>
> W/o patchset:
> real 0m1.518s
> user 0m0.000s
> sys 0m1.518s
>
> W/ patchset:
> real 0m1.018s
> user 0m0.000s
> sys 0m1.018s
>
> Reviewed-by: Ryan Roberts <ryan.roberts@xxxxxxx>
> Signed-off-by: Baolin Wang <baolin.wang@xxxxxxxxxxxxxxxxx>
Reviewed-by: Barry Song <baohua@xxxxxxxxxx>
> ---
> arch/arm64/include/asm/pgtable.h | 11 +++++++++++
> 1 file changed, 11 insertions(+)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 3dabf5ea17fa..a17eb8a76788 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1838,6 +1838,17 @@ static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> return contpte_clear_flush_young_ptes(vma, addr, ptep, 1);
> }
>
> +#define clear_flush_young_ptes clear_flush_young_ptes
> +static inline int clear_flush_young_ptes(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + unsigned int nr)
> +{
> + if (likely(nr == 1 && !pte_cont(__ptep_get(ptep))))
> + return __ptep_clear_flush_young(vma, addr, ptep);
> +
> + return contpte_clear_flush_young_ptes(vma, addr, ptep, nr);
> +}
A similar question arises here:
If nr = 4 for 16KB large folios and one of those entries is young,
we end up flushing the TLB for all 4 PTEs.
If all four entries are young, we win; if only one is young, it seems
we flush 3 redundant pages. but arm64 has TLB coalescing, so
maybe they are just one TLB?
> +
> #define wrprotect_ptes wrprotect_ptes
> static __always_inline void wrprotect_ptes(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep, unsigned int nr)
> --
> 2.47.3
Thanks
Barry