Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries
From: Peter Xu
Date: Tue Sep 03 2024 - 17:24:03 EST
On Mon, Sep 02, 2024 at 03:58:38PM +0800, Yan Zhao wrote:
> On Mon, Aug 26, 2024 at 04:43:41PM -0400, Peter Xu wrote:
> > Teach the fork code to properly copy pfnmaps for pmd/pud levels. Pud is
> > much easier, the write bit needs to be persisted though for writable and
> > shared pud mappings like PFNMAP ones, otherwise a follow up write in either
> > parent or child process will trigger a write fault.
> >
> > Do the same for pmd level.
> >
> > Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> > ---
> > mm/huge_memory.c | 29 ++++++++++++++++++++++++++---
> > 1 file changed, 26 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e2c314f631f3..15418ffdd377 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1559,6 +1559,24 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > pgtable_t pgtable = NULL;
> > int ret = -ENOMEM;
> >
> > + pmd = pmdp_get_lockless(src_pmd);
> > + if (unlikely(pmd_special(pmd))) {
> > + dst_ptl = pmd_lock(dst_mm, dst_pmd);
> > + src_ptl = pmd_lockptr(src_mm, src_pmd);
> > + spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
> > + /*
> > + * No need to recheck the pmd, it can't change with write
> > + * mmap lock held here.
> > + *
> > + * Meanwhile, making sure it's not a CoW VMA with writable
> > + * mapping, otherwise it means either the anon page wrongly
> > + * applied special bit, or we made the PRIVATE mapping be
> > + * able to wrongly write to the backend MMIO.
> > + */
> > + VM_WARN_ON_ONCE(is_cow_mapping(src_vma->vm_flags) && pmd_write(pmd));
> > + goto set_pmd;
> > + }
> > +
> > /* Skip if can be re-fill on fault */
> > if (!vma_is_anonymous(dst_vma))
> > return 0;
> > @@ -1640,7 +1658,9 @@ int copy_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > pmdp_set_wrprotect(src_mm, addr, src_pmd);
> > if (!userfaultfd_wp(dst_vma))
> > pmd = pmd_clear_uffd_wp(pmd);
> > - pmd = pmd_mkold(pmd_wrprotect(pmd));
> > + pmd = pmd_wrprotect(pmd);
> > +set_pmd:
> > + pmd = pmd_mkold(pmd);
> > set_pmd_at(dst_mm, addr, dst_pmd, pmd);
> >
> > ret = 0;
> > @@ -1686,8 +1706,11 @@ int copy_huge_pud(struct mm_struct *dst_mm, struct mm_struct *src_mm,
> > * TODO: once we support anonymous pages, use
> > * folio_try_dup_anon_rmap_*() and split if duplicating fails.
> > */
> > - pudp_set_wrprotect(src_mm, addr, src_pud);
> > - pud = pud_mkold(pud_wrprotect(pud));
> > + if (is_cow_mapping(vma->vm_flags) && pud_write(pud)) {
> > + pudp_set_wrprotect(src_mm, addr, src_pud);
> > + pud = pud_wrprotect(pud);
> > + }
> Do we need the logic to clear dirty bit in the child as that in
> __copy_present_ptes()? (and also for the pmd's case).
>
> e.g.
> if (vma->vm_flags & VM_SHARED)
> pud = pud_mkclean(pud);
Yeah, good question. I remember I thought about that when initially
working on these lines, but I forgot the details, or maybe I simply tried
to stick with the current code base, as the dirty bit used to be kept even
in the child here.
I'd expect there's only performance differences, but still sounds like I'd
better leave that to whoever knows the best on the implications, then draft
it as a separate patch but only when needed.
Thanks,
--
Peter Xu