Re: [PATCH RESEND v2 0/9] Merge arm64/riscv hugetlbfs contpte support
From: Alexandre Ghiti
Date: Thu Aug 01 2024 - 04:58:24 EST
Hi Ryan,
On Tue, Jul 2, 2024 at 9:51 AM Alexandre Ghiti <alex@xxxxxxxx> wrote:
>
> Hi Ryan,
>
> On 24/06/2024 10:00, Ryan Roberts wrote:
> > On 28/05/2024 09:07, Alexandre Ghiti wrote:
> >> Hi Ryan,
> >>
> >> On 12/05/2024 19:25, Alexandre Ghiti wrote:
> >>> Hi Ryan,
> >>>
> >>> On Fri, May 10, 2024 at 3:49 PM Ryan Roberts <ryan.roberts@xxxxxxx> wrote:
> >>>> On 08/05/2024 12:34, Alexandre Ghiti wrote:
> >>>>> This patchset intends to merge the contiguous ptes hugetlbfs implementation
> >>>>> of arm64 and riscv.
> >>>>>
> >>>>> Both arm64 and riscv support the use of contiguous ptes to map pages that
> >>>>> are larger than the default page table size, respectively called contpte
> >>>>> and svnapot.
> >>>>>
> >>>>> The riscv implementation differs from the arm64's in that the LSBs of the
> >>>>> pfn of a svnapot pte are used to store the size of the mapping, allowing
> >>>>> for future sizes to be added (for now only 64KB is supported). That's an
> >>>>> issue for the core mm code which expects to find the *real* pfn a pte points
> >>>>> to. Patch 1 fixes that by always returning svnapot ptes with the real pfn
> >>>>> and restores the size of the mapping when it is written to a page table.
> >>>>>
> >>>>> The following patches are just merges of the 2 different implementations
> >>>>> that currently exist in arm64 and riscv which are very similar. It paves
> >>>>> the way to the reuse of the recent contpte THP work by Ryan [1] to avoid
> >>>>> reimplementing the same in riscv.
> >>>> Hi Alexandre,
> >>>>
> >>>> I've skimmed through this series and the one that moves contpte. I can see there
> >>>> is definitely value in sharing the implementation, and the rough shape of things
> >>>> seems appropriate. I had some minor concerns about making it harder to implement
> >>>> potential future arm64 errata workarounds but on reflection, most of the
> >>>> now-shared code is really just wrapping the primitives that are still
> >>>> arch-specific.
> >>>>
> >>>> I'm going to need to spend proper time reviewing it to give detailed feedback,
> >>>> but I'll be out on paternity leave for 3 weeks from end of Monday at the latest.
> >>> Too bad, I expected to discuss that with you at LSF/MM...But congrats!
> >>> Hope your wife is fine :)
> >>>
> >>>> So realistically I won't be able to do the detailed review until at least the
> >>>> first week of June.
> > Hi Alexandre,
> >
> > Sorry for the radio silence. I'm back at work now and have some cycles to review
> > this. Did you ever post a new version based on the suggestions below?
>
>
> Unfortunately no, other things happened that took all my attention, sorry.
>
>
> >>>> Some high level thoughts:
> >>>>
> >>>> - huge_ptep_* functions could be working on different sized huge ptes - arm64
> >>>> supports contpte, pmd, contpmd and pud. Is keeping them in contpte.c
> >>>> appropriate?
> >>> Hmm indeed, I'll see what I can do.
> >>
> >> So I took a look at that. It amounts to doing the same as what we do for THP
> >> contptes, ie having both contpte-aware and "normal" APIs. Let's take for example
> >> huge_ptep_get(), below is what I get. To me it's not that bad, so I'll implement
> >> this unless there is strong opposition.
> > I'm not sure I've understood what you are going here... see below.
> >
> >>
> >> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> >> index f8efbc128446..869a9aae6c68 100644
> >> --- a/arch/arm64/include/asm/pgtable.h
> >> +++ b/arch/arm64/include/asm/pgtable.h
> >> @@ -1715,6 +1715,16 @@ static inline void clear_young_dirty_ptes(struct
> >> vm_area_struct *vma,
> >> contpte_clear_young_dirty_ptes(vma, addr, ptep, nr, flags);
> >> }
> >>
> >> +static inline pte_t huge_ptep_get(pte_t *ptep)
> >> +{
> >> + pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> + if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> >> + return orig_pte;
> >> +
> >> + return contpte_huge_ptep_get(ptep);
> > A "huge pte" is not the same as a "cont pte". A huge pte is an abstract thing,
> > which maybe of a number of different sizes; on arm64 with 4K base pages, 64K,
> > 2M, 32M, 1G are supported. The 64K size is implemented using the PTE_CONT bit at
> > PTE level. 2M is a single PMD level block, 32M uses PMD_CONT at PMD level and 1G
> > is 1 PUD block. So I'm not sure it makes sense to tie this up with "contpte_"
> > functions?
> >
> >> +}
> >> +
> >> #else /* CONFIG_ARM64_CONTPTE */
> >>
> >> #define ptep_get __ptep_get
> >> @@ -1736,6 +1746,8 @@ static inline void clear_young_dirty_ptes(struct
> >> vm_area_struct *vma,
> >> #define ptep_set_access_flags __ptep_set_access_flags
> >> #define clear_young_dirty_ptes __clear_young_dirty_ptes
> >>
> >> +#define huge_ptep_get __ptep_get
> > I don't quite understand the logic here. huge ptes are needed for hugetlb so
> > their definition needs to be tied to that, not to ARM64_CONTPTE, which is an
> > independent feature.
> >
> >> +
> >> #endif /* CONFIG_ARM64_CONTPTE */
> >>
> >> #endif /* !__ASSEMBLY__ */
> >> diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
> >> index 3f09ac73cce3..aa0ee3f02226 100644
> >> --- a/arch/arm64/mm/hugetlbpage.c
> >> +++ b/arch/arm64/mm/hugetlbpage.c
> >> @@ -127,28 +127,6 @@ static inline int num_contig_ptes(unsigned long size,
> >> size_t *pgsize)
> >> return contig_ptes;
> >> }
> >>
> >> -pte_t huge_ptep_get(pte_t *ptep)
> >> -{
> >> - int ncontig, i;
> >> - size_t pgsize;
> >> - pte_t orig_pte = __ptep_get(ptep);
> >> -
> >> - if (!pte_present(orig_pte) || !pte_cont(orig_pte))
> >> - return orig_pte;
> >> -
> >> - ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> >> - for (i = 0; i < ncontig; i++, ptep++) {
> >> - pte_t pte = __ptep_get(ptep);
> >> -
> >> - if (pte_dirty(pte))
> >> - orig_pte = pte_mkdirty(orig_pte);
> >> -
> >> - if (pte_young(pte))
> >> - orig_pte = pte_mkyoung(orig_pte);
> >> - }
> >> - return orig_pte;
> >> -}
> >> -
> >> /*
> >> * Changing some bits of contiguous entries requires us to follow a
> >> * Break-Before-Make approach, breaking the whole contiguous set
> >> diff --git a/mm/contpte.c b/mm/contpte.c
> >> new file mode 100644
> >> index 000000000000..4e742cf00b6f
> >> --- /dev/null
> >> +++ b/mm/contpte.c
> >> @@ -0,0 +1,17 @@
> >> +pte_t contpte_huge_ptep_get(pte_t *ptep)
> >> +{
> >> + int ncontig, i;
> >> + size_t pgsize;
> >> +
> >> + ncontig = num_contig_ptes(page_size(pte_page(orig_pte)), &pgsize);
> >> + for (i = 0; i < ncontig; i++, ptep++) {
> >> + pte_t pte = __ptep_get(ptep);
> >> +
> >> + if (pte_dirty(pte))
> >> + orig_pte = pte_mkdirty(orig_pte);
> >> +
> >> + if (pte_young(pte))
> >> + orig_pte = pte_mkyoung(orig_pte);
> >> + }
> >> + return orig_pte;
> >> +}
> > I guess your observation is that contpte_ and hugepte_ code looks similar so it
> > shold be grouped? I think if we can get some actual reuse that might make sense,
> > but as implemented, this function is completely separate from
> > contpte_ptep_get(). I wonder if its simpler just to have contpte.c for contpte_
> > and hugepte_.c for hugepte_ then they can be included in the build independently
> > based on arch/core Kconfigs (e.g. CONFIG_HUGETLB_PAGE vs CONFIG_ARM64_CONTPTE).
>
>
> Yes, you're right, this was just rambling :)
>
>
> >
> >>>> Perhaps it's better to keep huge_pte and contpte separate? Also, it
> >>>> only works on arm64 because we can get away with calling the lower-level pte
> >>>> functions even when the huge_pte is actually a contpmd/pmd/pud, because the
> >>>> format is the same. That might present challenges to other arches if the format
> >>>> is different?
> >>> Yes, but I think that if that happens, we could get away with it by
> >>> choosing the right function depending on the size of the mapping?
> >>>
> >>>> - It might be easier to review if the arm64 stuff is first moved (without
> >>>> changes) then modified to make it suitable for riscv, then for riscv to be
> >>>> hooked up. At the moment I'm trying to follow all 3 parts per-function.
> >>> Ok, let me give it a try during your paternity leave!
> > Review would certainly be easier with this approach!
>
>
> I'll do my best to do that soon.
So I finally have time to rework this and I tested moving all the
arm64 functions first and then adapting them to riscv. But I can't
modify a function at a time as it fails to build for riscv, since the
not-yet-adapted functions are arm64 specific. So I'd have to come with
a big monolithic patch that changes everything at once, which I think
will not help the review process.
So I'll keep the same format and should be back soon with the new version.
Thanks,
Alex
>
> Hope everything went well for you.
>
> Thanks,
>
> Alex
>
>
> >
> > Thanks,
> > Ryan
> >
> >>>> Thanks,
> >>>> Ryan
> >>> Thanks,
> >>>
> >>> Alex
> >>>
> >>>>> This patchset was tested by running the libhugetlbfs testsuite with 64KB
> >>>>> and 2MB pages on both architectures (on a 4KB base page size arm64 kernel).
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/linux-arm-kernel/20240215103205.2607016-1-ryan.roberts@xxxxxxx/
> >>>>>
> >>>>> Changes in v2:
> >>>>> - Rebase on top of 6.9-rc3
> >>>>>
> >>>>> Alexandre Ghiti (9):
> >>>>> riscv: Restore the pfn in a NAPOT pte when manipulated by core mm code
> >>>>> riscv: Safely remove huge_pte_offset() when manipulating NAPOT ptes
> >>>>> mm: Use common huge_ptep_get() function for riscv/arm64
> >>>>> mm: Use common set_huge_pte_at() function for riscv/arm64
> >>>>> mm: Use common huge_pte_clear() function for riscv/arm64
> >>>>> mm: Use common huge_ptep_get_and_clear() function for riscv/arm64
> >>>>> mm: Use common huge_ptep_set_access_flags() function for riscv/arm64
> >>>>> mm: Use common huge_ptep_set_wrprotect() function for riscv/arm64
> >>>>> mm: Use common huge_ptep_clear_flush() function for riscv/arm64
> >>>>>
> >>>>> arch/arm64/Kconfig | 1 +
> >>>>> arch/arm64/include/asm/pgtable.h | 56 +++++-
> >>>>> arch/arm64/mm/hugetlbpage.c | 291 +---------------------------
> >>>>> arch/riscv/Kconfig | 1 +
> >>>>> arch/riscv/include/asm/hugetlb.h | 2 +-
> >>>>> arch/riscv/include/asm/pgtable-64.h | 11 ++
> >>>>> arch/riscv/include/asm/pgtable.h | 153 +++++++++++++--
> >>>>> arch/riscv/mm/hugetlbpage.c | 227 ----------------------
> >>>>> arch/riscv/mm/pgtable.c | 6 +-
> >>>>> mm/Kconfig | 3 +
> >>>>> mm/Makefile | 1 +
> >>>>> mm/contpte.c | 272 ++++++++++++++++++++++++++
> >>>>> 12 files changed, 480 insertions(+), 544 deletions(-)
> >>>>> create mode 100644 mm/contpte.c
> >>>>>
> >>> _______________________________________________
> >>> linux-riscv mailing list
> >>> linux-riscv@xxxxxxxxxxxxxxxxxxx
> >>> http://lists.infradead.org/mailman/listinfo/linux-riscv