Re: [PATCH RFC v2 1/1] hugetlbfs: use i_mmap_rwsem for pmd sharing and truncate/fault sync
From: Naoya Horiguchi
Date: Thu Oct 25 2018 - 20:44:02 EST
Hi Mike,
On Tue, Oct 23, 2018 at 09:50:53PM -0700, Mike Kravetz wrote:
> hugetlbfs does not correctly handle page faults racing with truncation.
> In addition, shared pmds can cause additional issues.
>
> Without pmd sharing, issues can occur as follows:
> A huegtlbfs file is mmap(MAP_SHARED) with a size of 4 pages. At
> mmap time, 4 huge pages are reserved for the file/mapping. So,
> the global reserve count is 4. In addition, since this is a shared
> mapping an entry for 4 pages is added to the file's reserve map.
> The first 3 of the 4 pages are faulted into the file. As a result,
> the global reserve count is now 1.
>
> Task A starts to fault in the last page (routines hugetlb_fault,
> hugetlb_no_page). It allocates a huge page (alloc_huge_page).
> The reserve map indicates there is a reserved page, so this is
> used and the global reserve count goes to 0.
>
> Now, task B truncates the file to size 0. It starts by setting
> inode size to 0(hugetlb_vmtruncate). It then unmaps all mapping
> of the file (hugetlb_vmdelete_list). Since task A's page table
> lock is not held at the time, truncation is not blocked. Truncation
> removes the 3 pages from the file (remove_inode_hugepages). When
> cleaning up the reserved pages (hugetlb_unreserve_pages), it notices
> the reserve map was for 4 pages. However, it has only freed 3 pages.
> So it assumes there is still (4 - 3) 1 reserved pages. It then
> decrements the global reserve count by 1 and it goes negative.
>
> Task A then continues the page fault process and adds it's newly
> acquired page to the page cache. Note that the index of this page
> is beyond the size of the truncated file (0). The page fault process
> then notices the file has been truncated and exits. However, the
> page is left in the cache associated with the file.
>
> Now, if the file is immediately deleted the truncate code runs again.
> It will find and free the one page associated with the file. When
> cleaning up reserves, it notices the reserve map is empty. Yet, one
> page freed. So, the global reserve count is decremented by (0 - 1) -1.
> This returns the global count to 0 as it should be. But, it is
> possible for someone else to mmap this file/range before it is deleted.
> If this happens, a reserve map entry for the allocated page is created
> and the reserved page is forever leaked.
>
> With pmd sharing, the situation is even worse. Consider the following:
> A task processes a page fault on a shared hugetlbfs file and calls
> huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
> shared pmd.
>
> Now, anopther task truncates the hugetlbfs file. As part of truncation,
> it unmaps everyone who has the file mapped. If a task has a shared pmd
> in this range, huge_pmd_unshhare will be called. If this is not the last
(sorry, nitpicking ..) a few typos ("anophter" and "unshhare").
> user sharing the pmd, huge_pmd_unshare will clear pud pointing to the
> pmd. For the task in the middle of the page fault, the ptep returned by
> huge_pte_alloc points to another task's page table or worse. This leads
> to bad things such as incorrect page map/reference counts or invalid
> memory references.
>
> i_mmap_rwsem is currently used for pmd sharing synchronization. It is also
> held during unmap and whenever a call to huge_pmd_unshare is possible. It
> is only acquired in write mode. Expand and modify the use of i_mmap_rwsem
> as follows:
> - i_mmap_rwsem is held in write mode for the duration of truncate
> processing.
> - i_mmap_rwsem is held in write mode whenever huge_pmd_share is called.
I guess you mean huge_pmd_unshare here, right?
> - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
> Today that is only via huge_pte_alloc.
> - i_mmap_rwsem is held in read mode after huge_pte_alloc, until the caller
> is finished with the returned ptep.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
> ---
> fs/hugetlbfs/inode.c | 21 ++++++++++----
> mm/hugetlb.c | 65 +++++++++++++++++++++++++++++++++-----------
> mm/rmap.c | 10 +++++++
> mm/userfaultfd.c | 11 ++++++--
> 4 files changed, 84 insertions(+), 23 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 32920a10100e..6ee97622a231 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -426,10 +426,16 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> u32 hash;
>
> index = page->index;
> - hash = hugetlb_fault_mutex_hash(h, current->mm,
> + /*
> + * No need to take fault mutex for truncation as we
> + * are synchronized via i_mmap_rwsem.
> + */
> + if (!truncate_op) {
> + hash = hugetlb_fault_mutex_hash(h, current->mm,
> &pseudo_vma,
> mapping, index, 0);
> - mutex_lock(&hugetlb_fault_mutex_table[hash]);
> + mutex_lock(&hugetlb_fault_mutex_table[hash]);
> + }
>
> /*
> * If page is mapped, it was faulted in after being
> @@ -470,7 +476,8 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
> }
>
> unlock_page(page);
> - mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> + if (!truncate_op)
> + mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> }
> huge_pagevec_release(&pvec);
> cond_resched();
> @@ -505,8 +512,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
> i_mmap_lock_write(mapping);
> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
> hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
> - i_mmap_unlock_write(mapping);
> remove_inode_hugepages(inode, offset, LLONG_MAX);
> + i_mmap_unlock_write(mapping);
I just have an impression that hugetlbfs_punch_hole() could have the
similar race and extending lock range there could be an improvement,
although I might miss something as always.
> return 0;
> }
>
> @@ -624,7 +631,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
> /* addr is the offset within the file (zero based) */
> addr = index * hpage_size;
>
> - /* mutex taken here, fault path and hole punch */
> + /*
> + * fault mutex taken here, protects against fault path
> + * and hole punch. inode_lock previously taken protects
> + * against truncation.
> + */
> hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
> index, addr);
> mutex_lock(&hugetlb_fault_mutex_table[hash]);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7b5c0ad9a6bd..e9da3eee262f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3252,18 +3252,33 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>
> for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
> spinlock_t *src_ptl, *dst_ptl;
> + struct vm_area_struct *dst_vma;
> + struct address_space *mapping;
> +
> src_pte = huge_pte_offset(src, addr, sz);
> if (!src_pte)
> continue;
> +
> + /*
> + * i_mmap_rwsem must be held to call huge_pte_alloc.
> + * Continue to hold until finished with dst_pte, otherwise
> + * it could go away if part of a shared pmd.
> + */
> + dst_vma = find_vma(dst, addr);
> + mapping = dst_vma->vm_file->f_mapping;
If vma->vm_file->f_mapping gives the same mapping, you may omit the find_vma()?
> + i_mmap_lock_read(mapping);
> dst_pte = huge_pte_alloc(dst, addr, sz);
> if (!dst_pte) {
> + i_mmap_unlock_read(mapping);
> ret = -ENOMEM;
> break;
> }
>
> /* If the pagetables are shared don't copy or take references */
> - if (dst_pte == src_pte)
> + if (dst_pte == src_pte) {
> + i_mmap_unlock_read(mapping);
> continue;
> + }
>
> dst_ptl = huge_pte_lock(h, dst, dst_pte);
> src_ptl = huge_pte_lockptr(h, src, src_pte);
[...]
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 1e79fac3186b..db49e734dda8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1347,6 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> bool ret = true;
> unsigned long start = address, end;
> enum ttu_flags flags = (enum ttu_flags)arg;
> + bool pmd_sharing_possible = false;
>
> /* munlock has nothing to gain from examining un-locked vmas */
> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
> @@ -1376,8 +1377,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
> * accordingly.
> */
> adjust_range_if_pmd_sharing_possible(vma, &start, &end);
> + if ((end - start) > (PAGE_SIZE << compound_order(page)))
> + pmd_sharing_possible = true;
Maybe the similar check is done in adjust_range_if_pmd_sharing_possible()
as the function name claims, so does it make more sense to get this bool
value via the return value?
Thanks,
Naoya Horiguchi