Re: [PATCH] hugetlb: unshare some PMDs when splitting VMAs
From: James Houghton
Date: Wed Jan 04 2023 - 18:13:35 EST
On Wed, Jan 4, 2023 at 8:03 PM Peter Xu <peterx@xxxxxxxxxx> wrote:
>
> On Wed, Jan 04, 2023 at 07:10:11PM +0000, James Houghton wrote:
> > > > I'll see if I can confirm that this is indeed possible and send a
> > > > repro if it is.
> > >
> > > I think your analysis above is correct. The key being the failure to unshare
> > > in the non-PUD_SIZE vma after the split.
> >
> > I do indeed hit the WARN_ON_ONCE (repro attached), and the MADV wasn't
> > even needed (the UFFDIO_REGISTER does the VMA split before "unsharing
> > all PMDs"). With the fix, we avoid the WARN_ON_ONCE, but the behavior
> > is still incorrect: I expect the address range to be write-protected,
> > but it isn't.
> >
> > The reason why is that hugetlb_change_protection uses huge_pte_offset,
> > even if it's being called for a UFFDIO_WRITEPROTECT with
> > UFFDIO_WRITEPROTECT_MODE_WP. In that particular case, I'm pretty sure
> > we should be using huge_pte_alloc, but even so, it's not trivial to
> > get an allocation failure back up to userspace. The non-hugetlb
> > implementation of UFFDIO_WRITEPROTECT seems to also have this problem.
> >
> > Peter, what do you think?
>
> Indeed. Thanks for spotting that, James.
>
> Non-hugetlb should be fine with having empty pgtable entries. Anon doesn't
> need to care about no-pgtable-populated ranges so far. Shmem does it with a
> few change_prepare() calls to populate the entries so the markers can be
> installed later on.
Ah ok! :)
>
> However I think the fault handling is still not well handled as you pointed
> out even for shmem: that's the path I probably never triggered myself yet
> before and the code stayed there since a very early version:
>
> #define change_pmd_prepare(vma, pmd, cp_flags) \
> do { \
> if (unlikely(uffd_wp_protect_file(vma, cp_flags))) { \
> if (WARN_ON_ONCE(pte_alloc(vma->vm_mm, pmd))) \
> break; \
> } \
> } while (0)
>
> I think a better thing we can do here (instead of warning and stop the
> UFFDIO_WRITEPROTECT at the current stage) is returning with -ENOMEM
> properly so the user can know the error. We'll need to touch the stacks up
> to uffd_wp_range() as it's the only one that can trigger the -ENOMEM so
> far, so as to not ignore retval from change_protection().
>
> Meanwhile, I'd also wonder whether we should call pagefault_out_of_memory()
> because it should be the same as when pgtable allocation failure happens in
> page faults, we may want to OOM already. I can take care of hugetlb part
> too along the way.
I might be misunderstanding, the only case where
hugetlb_change_protection() would *need* to allocate is when it is
called from UFFDIO_WRITEPROTECT, not while handling a #pf. So I don't
think any calls to pagefault_out_of_memory() need to be added.
>
> Man page of UFFDIO_WRITEPROTECT may need a fixup too to introduce -ENOMEM.
>
> I can quickly prepare some patches for this, and hopefully it doesn't need
> to block the current fix on split.
I don't think it should block this splitting fix. I'll send another
version of this fix soon.
>
> Any thoughts?
>
> >
> > >
> > > To me, the fact it was somewhat difficult to come up with this scenario is an
> > > argument what we should just unshare at split time as you propose. Who
> > > knows what other issues may exist.
> > >
> > > > 60dfaad65a ("mm/hugetlb: allow uffd wr-protect none ptes") is the
> > > > commit that introduced the WARN_ON_ONCE; perhaps it's a good choice
> > > > for a Fixes: tag (if above is indeed true).
> > >
> > > If the key issue in your above scenario is indeed the failure of
> > > hugetlb_unshare_all_pmds in the non-PUD_SIZE vma, then perhaps we tag?
> > >
> > > 6dfeaff93be1 ("hugetlb/userfaultfd: unshare all pmds for hugetlbfs when
> > > register wp")
> >
> > SGTM. Thanks Mike.
>
> Looks good here too.
Thanks, Peter!