Re: [PATCH v1 11/14] arm64/mm: Wire up PTE_CONT for user mappings

From: Catalin Marinas
Date: Mon Jul 03 2023 - 11:17:06 EST


Hi Ryan,

Some comments below. I did not have time to trim down the quoted text,
so you may need to scroll through it.

On Thu, Jun 22, 2023 at 03:42:06PM +0100, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings. Whenever it detects a set of PTEs that meet the
> requirements for a contiguous range, the PTEs are re-painted with the
> PTE_CONT bit.
>
> This initial change provides a baseline that can be optimized in future
> commits. That said, fold/unfold operations (which imply tlb
> invalidation) are avoided where possible with a few tricks for
> access/dirty bit management.
>
> Write-enable and write-protect modifications are likely non-optimal and
> likely incure a regression in fork() performance. This will be addressed
> separately.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@xxxxxxx>
> ---
> arch/arm64/include/asm/pgtable.h | 137 ++++++++++++-
> arch/arm64/mm/Makefile | 3 +-
> arch/arm64/mm/contpte.c | 334 +++++++++++++++++++++++++++++++
> 3 files changed, 466 insertions(+), 8 deletions(-)
> create mode 100644 arch/arm64/mm/contpte.c
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 31df4d73f9ac..17ea534bc5b0 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1115,6 +1115,71 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t old_pte, pte_t new_pte);
>
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t entry, int dirty);
> +
> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> + return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> +}
> +
> +static inline bool contpte_is_enabled(struct mm_struct *mm)
> +{
> + /*
> + * Don't attempt to apply the contig bit to kernel mappings, because
> + * dynamically adding/removing the contig bit can cause page faults.
> + * These racing faults are ok for user space, since they get serialized
> + * on the PTL. But kernel mappings can't tolerate faults.
> + */
> +
> + return mm != &init_mm;
> +}
> +
> +static inline void contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * Only bother trying if both the virtual and physical addresses are
> + * aligned and correspond to the last entry in a contig range. The core
> + * code mostly modifies ranges from low to high, so this is the likely
> + * the last modification in the contig range, so a good time to fold.
> + */
> +
> + bool valign = ((unsigned long)ptep >> 3) % CONT_PTES == CONT_PTES - 1;
> + bool palign = pte_pfn(pte) % CONT_PTES == CONT_PTES - 1;
> +
> + if (contpte_is_enabled(mm) &&
> + pte_present(pte) && !pte_cont(pte) &&
> + valign && palign)
> + __contpte_try_fold(mm, addr, ptep, pte);

I would use pte_valid() here instead. pte_present() also includes the
PTE_PROT_NONE option which we don't really care about as it's not
accessible.

I've been discussing with Alexandru Elisei about PTE_PROT_NONE and
whether we can use other bits from the pte even if they clash with other
valid permissions. Since the pte is not valid, in theory we could as
long as nothing corrupts the (like a cont bit). The background to this
is multiple migrate types (not just NUMA) for the MTE tag carveout
reuse.

> +}
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + if (contpte_is_enabled(mm) &&
> + pte_present(pte) && pte_cont(pte))
> + __contpte_try_unfold(mm, addr, ptep, pte);
> +}

Same here and probably most other places where pte_present() is used in
this patch.

> +
> /*
> * The below functions constitute the public API that arm64 presents to the
> * core-mm to manipulate PTE entries within the their page tables (or at least
> @@ -1122,30 +1187,68 @@ extern void ptep_modify_prot_commit(struct vm_area_struct *vma,
> * versions will automatically and transparently apply the contiguous bit where
> * it makes sense to do so. Therefore any users that are contig-aware (e.g.
> * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> - * private versions, which are prefixed with double underscore.
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
> */
>
> #define ptep_get ptep_get
> static inline pte_t ptep_get(pte_t *ptep)
> {
> - return __ptep_get(ptep);
> + pte_t pte = __ptep_get(ptep);
> +
> + if (!pte_present(pte) || !pte_cont(pte))
> + return pte;
> +
> + return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> + pte_t pte = __ptep_get(ptep);
> +
> + if (!pte_present(pte) || !pte_cont(pte))
> + return pte;
> +
> + return contpte_ptep_get_lockless(ptep);
> }
>
> static inline void set_pte(pte_t *ptep, pte_t pte)
> {
> - __set_pte(ptep, pte);
> + /*
> + * We don't have the mm or vaddr so cannot unfold or fold contig entries
> + * (since it requires tlb maintenance). set_pte() is not used in core
> + * code, so this should never even be called. Regardless do our best to
> + * service any call and emit a warning if there is any attempt to set a
> + * pte on top of an existing contig range.
> + */
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + WARN_ON_ONCE(pte_present(orig_pte) && pte_cont(orig_pte));
> + __set_pte(ptep, pte_mknoncont(pte));

Why the pte_mknoncont() here? Do we expect a contiguous pte? The warning
only checks the old entry.

> }
>
> #define set_ptes set_ptes
> static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep, pte_t pte, unsigned int nr)
> {
> - __set_ptes(mm, addr, ptep, pte, nr);
> + pte = pte_mknoncont(pte);
> +
> + if (!contpte_is_enabled(mm))
> + __set_ptes(mm, addr, ptep, pte, nr);
> + else if (nr == 1) {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> + __set_ptes(mm, addr, ptep, pte, nr);
> + contpte_try_fold(mm, addr, ptep, pte);
> + } else
> + contpte_set_ptes(mm, addr, ptep, pte, nr);
> }
>
> static inline void pte_clear(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> __pte_clear(mm, addr, ptep);
> }
>
> @@ -1153,6 +1256,7 @@ static inline void pte_clear(struct mm_struct *mm,
> static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> return __ptep_get_and_clear(mm, addr, ptep);
> }
>
> @@ -1160,21 +1264,33 @@ static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep)
> {
> - return __ptep_test_and_clear_young(vma, addr, ptep);
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> + return __ptep_test_and_clear_young(vma, addr, ptep);

Since I've seen this construct a few times, you may want to turn it into
a specific check: pte_valid_cont().

> +
> + return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> }
>
> #define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep)
> {
> - return __ptep_clear_flush_young(vma, addr, ptep);
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> + return __ptep_clear_flush_young(vma, addr, ptep);
> +
> + return contpte_ptep_clear_flush_young(vma, addr, ptep);
> }
>
> #define __HAVE_ARCH_PTEP_SET_WRPROTECT
> static inline void ptep_set_wrprotect(struct mm_struct *mm,
> unsigned long addr, pte_t *ptep)
> {
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> __ptep_set_wrprotect(mm, addr, ptep);
> + contpte_try_fold(mm, addr, ptep, __ptep_get(ptep));
> }
>
> #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> @@ -1182,7 +1298,14 @@ static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> unsigned long addr, pte_t *ptep,
> pte_t entry, int dirty)
> {
> - return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> + pte_t orig_pte = __ptep_get(ptep);
> +
> + entry = pte_mknoncont(entry);

As in a few other places, it's not clear to me why the pte_mknoncont()
is needed. Here I expect 'entry' to be cont if *ptep is cont.

> +
> + if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> + return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);

Also wondering, can we have this check on 'entry' rather than
'orig_pte'? And maybe a warning if the cont bit differs between them.

> +
> + return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> }
>
> #endif /* !__ASSEMBLY__ */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..70b6aba09b5d 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -2,7 +2,8 @@
> obj-y := dma-mapping.o extable.o fault.o init.o \
> cache.o copypage.o flush.o \
> ioremap.o mmap.o pgd.o mmu.o \
> - context.o proc.o pageattr.o fixmap.o
> + context.o proc.o pageattr.o fixmap.o \
> + contpte.o
> obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
> obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
> obj-$(CONFIG_PTDUMP_DEBUGFS) += ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..e8e4a298fd53
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,334 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <asm/tlbflush.h>
> +
> +static void ptep_clear_flush_range(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, int nr)
> +{
> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> + unsigned long start_addr = addr;
> + int i;
> +
> + for (i = 0; i < nr; i++, ptep++, addr += PAGE_SIZE)
> + __pte_clear(mm, addr, ptep);
> +
> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +}
> +
> +static bool ptep_any_present(pte_t *ptep, int nr)

Valid?

> +{
> + int i;
> +
> + for (i = 0; i < nr; i++, ptep++) {
> + if (pte_present(__ptep_get(ptep)))
> + return true;
> + }
> +
> + return false;
> +}
> +
> +static void contpte_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, bool fold)
> +{
> + struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> + unsigned long start_addr;
> + pte_t *start_ptep;
> + int i;
> +
> + start_ptep = ptep = contpte_align_down(ptep);
> + start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> + pte = fold ? pte_mkcont(pte) : pte_mknoncont(pte);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> + pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> + if (pte_dirty(ptent))
> + pte = pte_mkdirty(pte);
> +
> + if (pte_young(ptent))
> + pte = pte_mkyoung(pte);
> + }

I presume this can be unsafe if any of the ptes in the range differ, so
we need some higher level check. But that means we now have three loops
for folding, one to check, another to clear and the last one via
__set_ptes(). I guess we can't collapse the first two loops in a 'try'
function as we need to do the cleaning (and would have to re-instate the
old entries if they can't be made contiguous).

> +
> + __flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> + __set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_fold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We have already checked that the virtual and pysical addresses are
> + * correctly aligned for a contig mapping in contpte_try_fold() so the
> + * remaining checks are to ensure that the contig range is fully covered
> + * by a single folio, and ensure that all the ptes are present with
> + * contiguous PFNs and matching prots.
> + */
> +
> + struct page *page = pte_page(pte);
> + struct folio *folio = page_folio(page);
> + unsigned long folio_saddr = addr - (page - &folio->page) * PAGE_SIZE;
> + unsigned long folio_eaddr = folio_saddr + folio_nr_pages(folio) * PAGE_SIZE;
> + unsigned long cont_saddr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> + unsigned long cont_eaddr = cont_saddr + CONT_PTE_SIZE;
> + unsigned long pfn;
> + pgprot_t prot;
> + pte_t subpte;
> + pte_t *orig_ptep;
> + int i;
> +
> + if (folio_saddr > cont_saddr || folio_eaddr < cont_eaddr)
> + return;
> +
> + pfn = pte_pfn(pte) - ((addr - cont_saddr) >> PAGE_SHIFT);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> + orig_ptep = ptep;
> + ptep = contpte_align_down(ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + subpte = __ptep_get(ptep);
> + subpte = pte_mkold(pte_mkclean(subpte));

IIUC, this function assumes ptes that only differ by the dirty status
can be contiguous. That's probably ok, with a chance of the dirty status
spreading to the adjacent ptes in the fold function. Maybe add a comment
on why this is ok (or why it doesn't happen).

> +
> + if (!pte_present(subpte) ||
> + pte_pfn(subpte) != pfn ||
> + pgprot_val(pte_pgprot(subpte)) != pgprot_val(prot))
> + return;
> + }
> +
> + contpte_fold(mm, addr, orig_ptep, pte, true);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte)
> +{
> + /*
> + * We have already checked that the ptes are contiguous in
> + * contpte_try_unfold(), so we can unfold unconditionally here.
> + */
> +
> + contpte_fold(mm, addr, ptep, pte, false);
> +}

So the pte_mkyoung(), pte_mkdirty() calls in contpte_fold() are mostly
for the unfold case. Maybe it's clearer if we just have two separate
functions (or document why the pte_mk*() are needed).

> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> + /*
> + * Gather access/dirty bits, which may be populated in any of the ptes
> + * of the contig range. We are guarranteed to be holding the PTL, so any
> + * contiguous range cannot be unfolded or otherwise modified under our
> + * feet.
> + */
> +
> + pte_t pte;
> + int i;
> +
> + ptep = contpte_align_down(ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++) {
> + pte = __ptep_get(ptep);
> +
> + /*
> + * Deal with the partial contpte_ptep_get_and_clear_full() case,
> + * where some of the ptes in the range may be cleared but others
> + * are still to do. See contpte_ptep_get_and_clear_full().
> + */
> + if (pte_val(pte) == 0)
> + continue;
> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> + /*
> + * Gather access/dirty bits, which may be populated in any of the ptes
> + * of the contig range. We may not be holding the PTL, so any contiguous
> + * range may be unfolded/modified/refolded under our feet.
> + */
> +
> + pte_t orig_pte;
> + pgprot_t orig_prot;
> + pte_t *ptep;
> + unsigned long pfn;
> + pte_t pte;
> + pgprot_t prot;
> + int i;
> +
> +retry:
> + orig_pte = __ptep_get(orig_ptep);
> +
> + if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> + return orig_pte;

I haven't looked through all the patches, so not entirely sure when this
function is called. But since you mention that the range may be
folded/unfolded, how do we deal with pte_cont() racing with something
setting the contig bit?

> +
> + orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> + ptep = contpte_align_down(orig_ptep);
> + pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> + pte = __ptep_get(ptep);
> + prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> + if (!pte_present(pte) || !pte_cont(pte) ||
> + pte_pfn(pte) != pfn ||
> + pgprot_val(prot) != pgprot_val(orig_prot))
> + goto retry;

It needs better documenting, I don't understand what the retry here is
for (presumably to handle some races). Do we care about some memory
ordering as well? __pte_get() only takes care of reading the ptep once.

> +
> + if (pte_dirty(pte))
> + orig_pte = pte_mkdirty(orig_pte);
> +
> + if (pte_young(pte))
> + orig_pte = pte_mkyoung(orig_pte);
> + }
> +
> + return orig_pte;
> +}
> +
> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> + pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> + unsigned long next;
> + unsigned long end = addr + (nr << PAGE_SHIFT);
> + unsigned long pfn = pte_pfn(pte);
> + pgprot_t prot = pte_pgprot(pte);
> + pte_t orig_pte;
> +
> + do {
> + next = pte_cont_addr_end(addr, end);
> + nr = (next - addr) >> PAGE_SHIFT;
> + pte = pfn_pte(pfn, prot);
> +
> + if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> + pte = pte_mkcont(pte);
> + else
> + pte = pte_mknoncont(pte);
> +
> + /*
> + * If operating on a partial contiguous range then we must first
> + * unfold the contiguous range if it was previously folded.
> + * Otherwise we could end up with overlapping tlb entries.
> + */
> + if (nr != CONT_PTES)
> + contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +
> + /*
> + * If we are replacing ptes that were contiguous or if the new
> + * ptes are contiguous and any of the ptes being replaced are
> + * present, we need to clear and flush the range to prevent
> + * overlapping tlb entries.
> + */
> + orig_pte = __ptep_get(ptep);
> + if ((pte_present(orig_pte) && pte_cont(orig_pte)) ||
> + (pte_cont(pte) && ptep_any_present(ptep, nr)))
> + ptep_clear_flush_range(mm, addr, ptep, nr);
> +
> + __set_ptes(mm, addr, ptep, pte, nr);
> +
> + addr = next;
> + ptep += nr;
> + pfn += nr;
> +
> + } while (addr != end);
> +}
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + /*
> + * ptep_clear_flush_young() technically requires us to clear the access
> + * flag for a _single_ pte. However, the core-mm code actually tracks
> + * access/dirty per folio, not per page. And since we only create a
> + * contig range when the range is covered by a single folio, we can get
> + * away with clearing young for the whole contig range here, so we avoid
> + * having to unfold.
> + */
> +
> + int i;
> + int young = 0;
> +
> + ptep = contpte_align_down(ptep);
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> + young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> + return young;
> +}
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep)
> +{
> + int young;
> +
> + young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + if (young) {
> + /*
> + * See comment in __ptep_clear_flush_young(); same rationale for
> + * eliding the trailing DSB applies here.
> + */
> + __flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> + PAGE_SIZE, true, 3);
> + }
> +
> + return young;
> +}
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> + unsigned long addr, pte_t *ptep,
> + pte_t entry, int dirty)
> +{
> + pte_t orig_pte;
> + int i;
> +
> + /*
> + * Gather the access/dirty bits for the contiguous range. If nothing has
> + * changed, its a noop.
> + */
> + orig_pte = ptep_get(ptep);
> + if (pte_val(orig_pte) == pte_val(entry))
> + return 0;
> +
> + /*
> + * We can fix up access/dirty bits without having to unfold/fold the
> + * contig range. But if the write bit is changing, we need to go through
> + * the full unfold/fold cycle.
> + */
> + if (pte_write(orig_pte) == pte_write(entry)) {

Depending on the architecture version, pte_write() either checks a
software only bit or it checks the DBM one.

> + /*
> + * No need to flush here; This is always "more permissive" so we
> + * can only be _adding_ the access or dirty bit. And since the
> + * tlb can't cache an entry without the AF set and the dirty bit
> + * is a SW bit, there can be no confusion. For HW access
> + * management, we technically only need to update the flag on a
> + * single pte in the range. But for SW access management, we
> + * need to update all the ptes to prevent extra faults.
> + */

On pre-DBM hardware, a PTE_RDONLY entry (writable from the kernel
perspective but clean) may be cached in the TLB and we do need flushing.

> + ptep = contpte_align_down(ptep);
> + addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> + for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> + __ptep_set_access_flags(vma, addr, ptep, entry, 0);
> + } else {
> + /*
> + * No need to flush in __ptep_set_access_flags() because we just
> + * flushed the whole range in __contpte_try_unfold().
> + */
> + __contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> + __ptep_set_access_flags(vma, addr, ptep, entry, 0);
> + contpte_try_fold(vma->vm_mm, addr, ptep, entry);
> + }
> +
> + return 1;
> +}

--
Catalin