Re: [PATCH v2 07/19] mm/fork: Accept huge pfnmap entries

From: Peter Xu
Date: Tue Sep 10 2024 - 08:16:30 EST


On Tue, Sep 10, 2024 at 10:52:01AM +0800, Yan Zhao wrote:
> Hi Peter,

Hi, Yan,

>
> Not sure if I missed anything.
>
> It looks that before this patch, pmd/pud are alawys write protected without
> checking "is_cow_mapping(vma->vm_flags) && pud_write(pud)". pud_wrprotect()
> clears dirty bit by moving the dirty value to the software bit.
>
> And I have a question that why previously pmd/pud are always write protected.

IIUC this is a separate question - the move of dirty bit in pud_wrprotect()
is to avoid wrongly creating shadow stack mappings. In our discussion I
think that's an extra complexity and can be put aside; the dirty bit will
get recovered in pud_clear_saveddirty() later, so it's not the same as
pud_mkclean().

AFAIU pmd/pud paths don't consider is_cow_mapping() because normally we
will not duplicate pgtables in fork() for most of shared file mappings
(!CoW). Please refer to vma_needs_copy(), and the comment before returning
false at last. I think it's not strictly is_cow_mapping(), as we're
checking anon_vma there, however it's mostly it, just to also cover
MAP_PRIVATE on file mappings too when there's no CoW happened (as if CoW
happened then anon_vma will appear already).

There're some outliers, e.g. userfault protected, or pfnmaps/mixedmaps.
Userfault & mixedmap are not involved in this series at all, so let's
discuss pfnmaps.

It means, fork() can still copy pgtable for pfnmap vmas, and it's relevant
to this series, because before this series pfnmap only exists in pte level,
hence IMO the is_cow_mapping() must exist for pte level as you described,
because it needs to properly take care of those. Note that in the pte
processing it also checks pte_write() to make sure it's a COWed page, not a
RO page cache / pfnmap / ..., for example.

Meanwhile, since pfnmap won't appear in pmd/pud, I think it's fair that
pmd/pud assumes when seeing a huge mapping it must be MAP_PRIVATE otherwise
the whole copy_page_range() could be already skipped. IOW I think they
only need to process COWed pages here, and those pages require write bit
removed in both parent and child when fork().

After this series, pfnmaps can appear in the form of pmd/pud, then the
previous assumption will stop holding true, as we'll still copy pfnmaps
during fork() always. My guessing of the reason is because most of the
drivers map pfnmap vmas only during mmap(), it means there can normally
have no fault() handler at all for those pfns.

In this case, we'll need to also identify whether the page is COWed, using
the newly added "is_cow_mapping() && pxx_write()" in this series (added
to pud path, while for pmd path I used a WARN_ON_ONCE instead).

If we don't do that, it means e.g. for a VM_SHARED pfnmap vma, after fork()
we'll wrongly observe write protected entries. Here the change will make
sure VM_SHARED can properly persist the write bits on pmds/puds.

Hope that explains.

Thanks,

--
Peter Xu