Re: [PATCH v6 1/5] mm: rmap: support batched checks of the references for large folios
From: Barry Song
Date: Thu Mar 26 2026 - 01:33:12 EST
On Thu, Mar 26, 2026 at 9:47 AM Baolin Wang
<baolin.wang@xxxxxxxxxxxxxxxxx> wrote:
>
>
>
> On 3/25/26 11:06 PM, Lorenzo Stoakes (Oracle) wrote:
> > On Wed, Mar 25, 2026 at 03:58:36PM +0100, David Hildenbrand (Arm) wrote:
> >> On 3/25/26 15:36, Lorenzo Stoakes (Oracle) wrote:
> >>> On Mon, Mar 16, 2026 at 03:15:18PM +0100, David Hildenbrand (Arm) wrote:
> >>>> On 3/16/26 07:25, Baolin Wang wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> Sure. However, after investigating RISC‑V and x86, I found that
> >>>>> ptep_clear_flush_young() does not flush the TLB on these architectures:
> >>>>>
> >>>>> int ptep_clear_flush_young(struct vm_area_struct *vma,
> >>>>> unsigned long address, pte_t *ptep)
> >>>>> {
> >>>>> /*
> >>>>> * On x86 CPUs, clearing the accessed bit without a TLB flush
> >>>>> * doesn't cause data corruption. [ It could cause incorrect
> >>>>> * page aging and the (mistaken) reclaim of hot pages, but the
> >>>>> * chance of that should be relatively low. ]
> >>>>> *
> >>>>> * So as a performance optimization don't flush the TLB when
> >>>>> * clearing the accessed bit, it will eventually be flushed by
> >>>>> * a context switch or a VM operation anyway. [ In the rare
> >>>>> * event of it not getting flushed for a long time the delay
> >>>>> * shouldn't really matter because there's no real memory
> >>>>> * pressure for swapout to react to. ]
> >>>>> */
> >>>>> return ptep_test_and_clear_young(vma, address, ptep);
> >>>>> }
> >>>>
> >>>> You'd probably want an arch helper then, that tells you whether
> >>>> a flush_tlb_range() after ptep_test_and_clear_young() is required.
> >>>>
> >>>> Or some special flush_tlb_range() helper.
> >>>>
> >>>> I agree that it requires more work.
>
> (Sorry, David. I forgot to reply to your email because I've had a lot to
> sort out recently.)
>
> Rather than adding more arch helpers (we already have plenty for the
> young flag check), I think we should try removing the TLB flush, as I
> mentioned to Barry[1]. MGLRU reclaim already skips the TLB flush, and it
> seems to work fine. What do you think?
>
> Here are our previous attempts to remove the TLB flush:
>
> My patch: https://lkml.org/lkml/2023/10/24/533
> Barry's patch:
> https://lore.kernel.org/lkml/20220617070555.344368-1-21cnbao@xxxxxxxxx/
>
> [1]
> https://lore.kernel.org/all/6bdc4b03-9631-4717-a3fa-2785a7930aba@xxxxxxxxxxxxxxxxx/
x86: ptep_clear_flush_young does not perform any TLB
invalidation. simply, calling ptep_test_and_clear_young()
RISC-V: follows the exact same behavior as x86.
S390:
simply, calling ptep_test_and_clear_young()
powerpc:
simply, calling ptep_test_and_clear_young();
parisc:
set_pte + __flush_cache_page
but ptep_test_and_clear_young() doesn't need __flush_cache_page()
arm64:
ptep_test_and_clear_young() followed by
flush_tlb_page_nosync() can still be expensive,
based on my previous observations.
others:
ptep_test_and_clear_young + flush_tlb_page
revisiting the comment for x86:
/*
* On x86 CPUs, clearing the accessed bit without a TLB flush
* doesn't cause data corruption. [ It could cause incorrect
* page aging and the (mistaken) reclaim of hot pages, but the
* chance of that should be relatively low. ]
*
* So as a performance optimization don't flush the TLB when
* clearing the accessed bit, it will eventually be flushed by
* a context switch or a VM operation anyway. [ In the rare
* event of it not getting flushed for a long time the delay
* shouldn't really matter because there's no real memory
* pressure for swapout to react to. ]
*/
At least I feel this also applies to ARM64?
Maybe Ryan, Will, or Catalin can clarify why ARM64 requires a
nosync TLBI, whereas x86 does not?
Thanks
Barry