Re: [PATCH 0/2] mm: rmap: merge HugeTLB mapcount logic with THPs

From: Peter Xu
Date: Thu Mar 09 2023 - 14:32:24 EST


On Thu, Mar 09, 2023 at 10:05:12AM -0800, James Houghton wrote:
> On Wed, Mar 8, 2023 at 2:10 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
> >
> > On Mon, Mar 06, 2023 at 11:00:02PM +0000, James Houghton wrote:
> > > HugeTLB pages may soon support being mapped with PTEs. To allow for this
> > > case, merge HugeTLB's mapcount scheme with THP's.
> > >
> > > The first patch of this series comes from the HugeTLB high-granularity
> > > mapping series[1], though with some updates, as the original version
> > > was buggy[2] and incomplete.
> > >
> > > I am sending this change as part of this smaller series in hopes that it
> > > can be more thoroughly scrutinized.
> > >
> > > I haven't run any THP performance tests with this series applied.
> > > HugeTLB pages don't currently support being mapped with
> > > `compound=false`, but this mapcount scheme will make collapsing
> > > compound=false mappings in HugeTLB pages quite slow. This can be
> > > optimized with future patches (likely by taking advantage of HugeTLB's
> > > alignment guarantees).
> > >
> > > Matthew Wilcox is working on a mapcounting scheme[3] that will avoid
> > > the use of each subpage's mapcount. If this series is applied, Matthew's
> > > new scheme will automatically apply to HugeTLB pages.
> >
> > Is this the plan?
> >
> > I may have not followed closely on the latest development of Matthew's
> > idea. The thing is if the design requires ptes being installed / removed
> > at the same time for the whole folio, then it may not work directly for HGM
> > if HGM wants to support at least postcopy, iiuc, because if we install the
> > whole folio ptes at the same time it seems to beat the whole purpose of
> > having HGM..
>
> My understanding is that it doesn't *require* all the PTEs in a folio
> to be mapped at the same time. I don't see how it possibly could,
> given that UFFDIO_CONTINUE exists (which can already create PTE-mapped
> THPs today). It would be faster to populate all the PTEs at the same
> time (you would only need to traverse the page table once for the
> entire group to see if you should be incrementing mapcount).
>
> Though, with respect to unmapping, if PTEs aren't all unmapped at the
> same time, then you could end up with a case where mapcount is still
> incremented but nothing is really mapped. I'm not really sure what
> should be done there, but this problem applies to PTE-mapped THPs the
> same way that it applies to HGMed HugeTLB pages.
>
> > The patch (especially patch 1) looks good. So it's a pure question just to
> > make sure we're on the same page. IIUC your other mapcount proposal may
> > work, but it still needs to be able to take care of ptes in less-than-folio
> > sizes whatever it'll look like at last.
>
> By my "other mapcount proposal", I assume you mean the "using the
> PAGE_SPECIAL bit to track if mapcount has been incremented or not". It
> really only serves as an optimization for Matthew's scheme (see below
> [2] for some more thoughts), and it doesn't have to only apply to
> HugeTLB.
>
> I originally thought[1] that Matthew's scheme would be really painful
> for postcopy for HGM without this optimization, but it's actually not
> so bad. Let's assume the worst case, that we're UFFDIO_CONTINUEing
> from the end to the beginning, like in [1]:
>
> First CONTINUE: pvmw finds an empty PUD, so quickly returns false.
> Second CONTINUE: pvmw finds 511 empty PMDs, then finds 511 empty PTEs,
> then finds a present PTE (from the first CONTINUE).
> Third CONTINUE: pvmw finds 511 empty PMDs, then finds 510 empty PTEs.
> ...
> 514th CONTINUE: pvmw finds 510 empty PMDs, then finds 511 empty PTEs.
>
> So it'll be slow, but it won't have to check 262k empty PTEs per
> CONTINUE (though you could make this possible with MADV_DONTNEED).
> Even with an HGM implementation that only allows PTE-mapping of
> HugeTLB pages, it should still behave just like this, too.
>
> > A trivial comment on patch 2 since we're at it: does "a future plan on some
> > arch to support 512GB huge page" justify itself? It would be better
> > justified, IMHO, when that support is added (and decided to use HGM)?
>
> That's fine with me. I'm happy to drop that patch.
>
> > What I feel like is missing (rather than patch 2 itself) is some guard to
> > make sure thp mapcountings will not be abused with new hugetlb sizes
> > coming.
> >
> > How about another BUG_ON() squashed into patch 1 (probably somewhere in
> > page_add_file|anon_rmap()) to make sure folio_size() is always smaller than
> > COMPOUND_MAPPED / 2)?
>
> Sure, I can add that.
>
> Thanks, Peter!
>
> - James
>
> [1]: https://lore.kernel.org/linux-mm/CADrL8HUrEgt+1qAtEsOHuQeA+WWnggGfLj8_nqHF0k-pqPi52w@xxxxxxxxxxxxxx/
>
> [2]: Some details on what the optimization might look like:
>
> So an excerpt of Matthew's scheme would look something like this:
>
> /* if we're mapping < folio_nr_pages(folio) worth of PTEs. */
> if (!folio_has_ptes(folio, vma))
> atomic_inc(folio->_mapcount);
>
> where folio_has_ptes() is defined like:
>
> if (!page_vma_mapped_walk(...))
> return false;
> page_vma_mapped_walk_done(...);
> return true;
>
> You might be able to optimize folio_has_ptes() with a block like this
> at the beginning:
>
> if (folio_is_naturally_aligned(folio, vma)) {
> /* optimization for naturally-aligned folios. */
> if (folio_test_hugetlb(folio)) {
> /* check hstate-level PTE, and do a similar check as below. */
> }
> /* for naturally-aligned THPs: */
> pmdp = mm_find_pmd(...); /* or just pass it in. */
> pmd = READ_ONCE(*pmdp);
> BUG_ON(!pmd_present(pmd) || pmd_leaf(pmd));
> if (pmd_special(pmd))
> return true;
> /* we already hold the PTL for the PTE. */
> ptl = pmd_lock(mm, pmdp);
> /* test and set pmd_special */
> pmd_unlock(ptl)
> return if_we_set_pmd_special;
> }
>
> (pmd_special() doesn't currently exist.) If HugeTLB walking code can
> be merged with generic mm, then HugeTLB wouldn't have a special case
> at all here.

I see what you mean now, thanks. That looks fine. I just suspect the
pte_special trick will still be needed if this will start to apply to HGM,
as it seems to not suite perfectly with a large folio size, still. The
MADV_DONTNEED worst case of having it loop over ~folio_size() times of none
pte is still possible.

--
Peter Xu