Re: [PATCH v2 09/10] mm/hugetlb: Introduce hugetlb_walk()

From: Peter Xu
Date: Thu Dec 08 2022 - 16:02:55 EST


On Wed, Dec 07, 2022 at 04:12:31PM -0800, John Hubbard wrote:
> On 12/7/22 12:31, Peter Xu wrote:
> > huge_pte_offset() is the main walker function for hugetlb pgtables. The
> > name is not really representing what it does, though.
> >
> > Instead of renaming it, introduce a wrapper function called hugetlb_walk()
> > which will use huge_pte_offset() inside. Assert on the locks when walking
> > the pgtable.
> >
> > Note, the vma lock assertion will be a no-op for private mappings.
> >
> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > ---
> > fs/hugetlbfs/inode.c | 4 +---
> > fs/userfaultfd.c | 6 ++----
> > include/linux/hugetlb.h | 39 +++++++++++++++++++++++++++++++++++++++
> > mm/hugetlb.c | 32 +++++++++++++-------------------
> > mm/page_vma_mapped.c | 2 +-
> > mm/pagewalk.c | 4 +---
> > 6 files changed, 57 insertions(+), 30 deletions(-)
> >
> > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> > index fdb16246f46e..48f1a8ad2243 100644
> > --- a/fs/hugetlbfs/inode.c
> > +++ b/fs/hugetlbfs/inode.c
> > @@ -388,9 +388,7 @@ static bool hugetlb_vma_maps_page(struct vm_area_struct *vma,
> > {
> > pte_t *ptep, pte;
> > - ptep = huge_pte_offset(vma->vm_mm, addr,
> > - huge_page_size(hstate_vma(vma)));
> > -
> > + ptep = hugetlb_walk(vma, addr, huge_page_size(hstate_vma(vma)));
> > if (!ptep)
> > return false;
> > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> > index a602f008dde5..f31fe1a9f4c5 100644
> > --- a/fs/userfaultfd.c
> > +++ b/fs/userfaultfd.c
> > @@ -237,14 +237,12 @@ static inline bool userfaultfd_huge_must_wait(struct userfaultfd_ctx *ctx,
> > unsigned long flags,
> > unsigned long reason)
> > {
> > - struct mm_struct *mm = ctx->mm;
> > pte_t *ptep, pte;
> > bool ret = true;
> > - mmap_assert_locked(mm);
> > -
> > - ptep = huge_pte_offset(mm, address, vma_mmu_pagesize(vma));
> > + mmap_assert_locked(ctx->mm);
> > + ptep = hugetlb_walk(vma, address, vma_mmu_pagesize(vma));
> > if (!ptep)
> > goto out;
> > diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> > index 81efd9b9baa2..1c20cbbf3d22 100644
> > --- a/include/linux/hugetlb.h
> > +++ b/include/linux/hugetlb.h
> > @@ -2,6 +2,7 @@
> > #ifndef _LINUX_HUGETLB_H
> > #define _LINUX_HUGETLB_H
> > +#include <linux/mm.h>
> > #include <linux/mm_types.h>
> > #include <linux/mmdebug.h>
> > #include <linux/fs.h>
> > @@ -196,6 +197,11 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma,
> > * huge_pte_offset(): Walk the hugetlb pgtable until the last level PTE.
> > * Returns the pte_t* if found, or NULL if the address is not mapped.
> > *
> > + * IMPORTANT: we should normally not directly call this function, instead
> > + * this is only a common interface to implement arch-specific walker.
> > + * Please consider using the hugetlb_walk() helper to make sure of the
> > + * correct locking is satisfied.
>
> Or:
>
> "Please use hugetlb_walk() instead, because that will attempt to verify
> the locking for you."

Ok.

>
> > + *
> > * Since this function will walk all the pgtable pages (including not only
> > * high-level pgtable page, but also PUD entry that can be unshared
> > * concurrently for VM_SHARED), the caller of this function should be
> > @@ -1229,4 +1235,37 @@ bool want_pmd_share(struct vm_area_struct *vma, unsigned long addr);
> > #define flush_hugetlb_tlb_range(vma, addr, end) flush_tlb_range(vma, addr, end)
> > #endif
> > +static inline bool
> > +__vma_shareable_flags_pmd(struct vm_area_struct *vma)
> > +{
> > + return vma->vm_flags & (VM_MAYSHARE | VM_SHARED) &&
> > + vma->vm_private_data;
> > +}
> > +
> > +/*
> > + * Safe version of huge_pte_offset() to check the locks. See comments
> > + * above huge_pte_offset().
> > + */
>
> It is odd to say that functionA() is a safe version of functionB(), if the
> names are completely different.
>
> At this point, it is very clear that huge_pte_offset() should be renamed.
> I'd suggest something like one of these:
>
> __hugetlb_walk()
> hugetlb_walk_raw()

We can.

Not only because that's an arch api for years (didn't want to touch more
arch code unless necessary), but also since we have hugetlb_walk() that'll
be the future interface not huge_pte_offset().

Actually it's good when that's the only thing people can find from its name
when they want to have a huge pgtable walk. :)

So totally makes sense to do so, but I don't strongly feel like doing it in
this patchset if you're okay with it.

>
> > +static inline pte_t *
> > +hugetlb_walk(struct vm_area_struct *vma, unsigned long addr, unsigned long sz)
> > +{
> > +#if defined(CONFIG_HUGETLB_PAGE) && \
> > + defined(CONFIG_ARCH_WANT_HUGE_PMD_SHARE) && defined(CONFIG_LOCKDEP)
> > + struct hugetlb_vma_lock *vma_lock = vma->vm_private_data;
> > +
> > + /*
> > + * If pmd sharing possible, locking needed to safely walk the
> > + * hugetlb pgtables. More information can be found at the comment
> > + * above huge_pte_offset() in the same file.
> > + *
> > + * NOTE: lockdep_is_held() is only defined with CONFIG_LOCKDEP.
> > + */
> > + if (__vma_shareable_flags_pmd(vma))
> > + WARN_ON_ONCE(!lockdep_is_held(&vma_lock->rw_sema) &&
> > + !lockdep_is_held(
> > + &vma->vm_file->f_mapping->i_mmap_rwsem));
> > +#endif
> > + return huge_pte_offset(vma->vm_mm, addr, sz);
> > +}
>
> Let's please not slice up C functions with ifdefs. Instead, stick to the
> standard approach of
>
> #ifdef X
> functionC()
> {
> ...implementation
> }
> #else
> functionC()
> {
> ...simpler or shorter or stub implementation
> }

Personally I like the slicing because it clearly tells what's the
difference with/without the macros defined.

Thanks,

--
Peter Xu