Re: [PATCH 21/23] hugetlb/userfaultfd: Only drop uffd-wp special pte if required

From: Mike Kravetz
Date: Fri Apr 23 2021 - 16:33:30 EST


On 3/22/21 5:50 PM, Peter Xu wrote:
> Just like what we've done with shmem uffd-wp special ptes, we shouldn't drop
> uffd-wp special swap pte for hugetlb too, only if we're going to unmap the
> whole vma, or we're punching a hole with safe locks held.
>
> For example, remove_inode_hugepages() is safe to drop uffd-wp ptes, because it
> has taken hugetlb fault mutex so that no concurrent page fault would trigger.
> While the call to hugetlb_vmdelete_list() in hugetlbfs_punch_hole() is not
> safe. That's why the previous call will be with ZAP_FLAG_DROP_FILE_UFFD_WP,
> while the latter one won't be able to.

This commit message is a bit confusing, but the hugetlb hole punch code
path is a bit confusing. :) How about something like this?

As with shmem uffd-wp special ptes, only drop the uffd-wp special swap
pte if unmapping an entire vma or synchronized such that faults can not
race with the unmap operation. This requires passing zap_flags all the
way to the lowest level hugetlb unmap routine: __unmap_hugepage_range.

In general, unmap calls originated in hugetlbfs code will pass the
ZAP_FLAG_DROP_FILE_UFFD_WP flag as synchronization is in place to prevent
faults. The exception is hole punch which will first unmap without any
synchronization. Later when hole punch actually removes the page from
the file, it will check to see if there was a subsequent fault and if
so take the hugetlb fault mutex while unmapping again. This second
unmap will pass in ZAP_FLAG_DROP_FILE_UFFD_WP.

>
> Signed-off-by: Peter Xu <peterx@xxxxxxxxxx>
> ---
> fs/hugetlbfs/inode.c | 15 +++++++++------
> include/linux/hugetlb.h | 13 ++++++++-----
> mm/hugetlb.c | 27 +++++++++++++++++++++------
> mm/memory.c | 5 ++++-
> 4 files changed, 42 insertions(+), 18 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index d81f52b87bd7..5fe19e801a2b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -399,7 +399,8 @@ static void remove_huge_page(struct page *page)
> }
>
> static void
> -hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
> +hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end,
> + unsigned long zap_flags)
> {
> struct vm_area_struct *vma;
>
> @@ -432,7 +433,7 @@ hugetlb_vmdelete_list(struct rb_root_cached *root, pgoff_t start, pgoff_t end)
> }
>
> unmap_hugepage_range(vma, vma->vm_start + v_offset, v_end,
> - NULL);
> + NULL, zap_flags);
> }
> }
>
> @@ -513,7 +514,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
> hugetlb_vmdelete_list(&mapping->i_mmap,
> index * pages_per_huge_page(h),
> - (index + 1) * pages_per_huge_page(h));
> + (index + 1) * pages_per_huge_page(h),
> + ZAP_FLAG_DROP_FILE_UFFD_WP);
> i_mmap_unlock_write(mapping);
> }
>
> @@ -579,7 +581,8 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
> i_mmap_lock_write(mapping);
> i_size_write(inode, offset);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> - hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
> + hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0,
> + ZAP_FLAG_DROP_FILE_UFFD_WP);
> i_mmap_unlock_write(mapping);
> remove_inode_hugepages(inode, offset, LLONG_MAX);
> }
> @@ -612,8 +615,8 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> i_mmap_lock_write(mapping);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> hugetlb_vmdelete_list(&mapping->i_mmap,
> - hole_start >> PAGE_SHIFT,
> - hole_end >> PAGE_SHIFT);
> + hole_start >> PAGE_SHIFT,
> + hole_end >> PAGE_SHIFT, 0);
> i_mmap_unlock_write(mapping);
> remove_inode_hugepages(inode, hole_start, hole_end);
> inode_unlock(inode);
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 92710600596e..4047fa042782 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -121,14 +121,15 @@ long follow_hugetlb_page(struct mm_struct *, struct vm_area_struct *,
> unsigned long *, unsigned long *, long, unsigned int,
> int *);
> void unmap_hugepage_range(struct vm_area_struct *,
> - unsigned long, unsigned long, struct page *);
> + unsigned long, unsigned long, struct page *,
> + unsigned long);
> void __unmap_hugepage_range_final(struct mmu_gather *tlb,
> struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> - struct page *ref_page);
> + struct page *ref_page, unsigned long zap_flags);
> void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct *vma,
> unsigned long start, unsigned long end,
> - struct page *ref_page);
> + struct page *ref_page, unsigned long zap_flags);

Nothing introduced with your patch, but it seems __unmap_hugepage_range_final
does not need to be in the header and can be static in hugetlb.c.

Everything else looks good,

Reviewed-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
--
Mike Kravetz