Re: [PATCH 5/5] mm/thp: Split huge pmds/puds if they're pinned when fork()

From: Yang Shi
Date: Wed Sep 23 2020 - 12:08:05 EST


On Wed, Sep 23, 2020 at 8:26 AM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Tue, Sep 22, 2020 at 09:05:05AM -0300, Jason Gunthorpe wrote:
> > On Mon, Sep 21, 2020 at 05:20:31PM -0400, Peter Xu wrote:
> > > Pinned pages shouldn't be write-protected when fork() happens, because follow
> > > up copy-on-write on these pages could cause the pinned pages to be replaced by
> > > random newly allocated pages.
> > >
> > > For huge PMDs, we split the huge pmd if pinning is detected. So that future
> > > handling will be done by the PTE level (with our latest changes, each of the
> > > small pages will be copied). We can achieve this by let copy_huge_pmd() return
> > > -EAGAIN for pinned pages, so that we'll fallthrough in copy_pmd_range() and
> > > finally land the next copy_pte_range() call.
> > >
> > > Huge PUDs will be even more special - so far it does not support anonymous
> > > pages. But it can actually be done the same as the huge PMDs even if the split
> > > huge PUDs means to erase the PUD entries. It'll guarantee the follow up fault
> > > ins will remap the same pages in either parent/child later.
> > >
> > > This might not be the most efficient way, but it should be easy and clean
> > > enough. It should be fine, since we're tackling with a very rare case just to
> > > make sure userspaces that pinned some thps will still work even without
> > > MADV_DONTFORK and after they fork()ed.
> > >
> > > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > > mm/huge_memory.c | 26 ++++++++++++++++++++++++++
> > > 1 file changed, 26 insertions(+)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 7ff29cc3d55c..c40aac0ad87e 100644
> > > +++ b/mm/huge_memory.c
> > > @@ -1074,6 +1074,23 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > >
> > > src_page = pmd_page(pmd);
> > > VM_BUG_ON_PAGE(!PageHead(src_page), src_page);
> > > +
> > > + /*
> > > + * If this page is a potentially pinned page, split and retry the fault
> > > + * with smaller page size. Normally this should not happen because the
> > > + * userspace should use MADV_DONTFORK upon pinned regions. This is a
> > > + * best effort that the pinned pages won't be replaced by another
> > > + * random page during the coming copy-on-write.
> > > + */
> > > + if (unlikely(READ_ONCE(src_mm->has_pinned) &&
> > > + page_maybe_dma_pinned(src_page))) {
> > > + pte_free(dst_mm, pgtable);
> > > + spin_unlock(src_ptl);
> > > + spin_unlock(dst_ptl);
> > > + __split_huge_pmd(vma, src_pmd, addr, false, NULL);
> > > + return -EAGAIN;
> > > + }
> >
> > Not sure why, but the PMD stuff here is not calling is_cow_mapping()
> > before doing the write protect. Seems like it might be an existing
> > bug?
>
> IMHO it's not a bug, because splitting a huge pmd should always be safe.
>
> One thing I can think of that might be special here is when the pmd is
> anonymously mapped but also shared (shared, tmpfs thp, I think?), then here
> we'll also mark it as wrprotected even if we don't need to (or maybe we need it
> for some reason..). But again I think it's safe anyways - when page fault

For tmpfs map, the pmd split just clears the pmd entry without
reinstalling ptes (oppositely anonymous map would reinstall ptes). It
looks this patch intends to copy at pte level by splitting pmd. But
I'm afraid this may not work for tmpfs mappings.

> happens, wp_huge_pmd() should split it into smaller pages unconditionally. I
> just don't know whether it's the ideal way for the shared case. Andrea should
> definitely know it better (because it is there since the 1st day of thp).
>
> >
> > In any event, the has_pinned logic shouldn't be used without also
> > checking is_cow_mapping(), so it should be added to that test. Same
> > remarks for PUD
>
> I think the case mentioned above is also the special case here when we didn't
> check is_cow_mapping(). The major difference is whether we'll split the page
> right now, or postpone it until the next write to each mm. But I think, yes,
> maybe I should better still keep the is_cow_mapping() to be explicit.
>
> Thanks,
>
> --
> Peter Xu
>