Re: [PATCH RFC v2 1/1] hugetlbfs: use i_mmap_rwsem for pmd sharing and truncate/fault sync

From: Mike Kravetz
Date: Tue Oct 30 2018 - 17:55:06 EST


On 10/25/18 5:42 PM, Naoya Horiguchi wrote:
> Hi Mike,
>
> On Tue, Oct 23, 2018 at 09:50:53PM -0700, Mike Kravetz wrote:
>> Now, anopther task truncates the hugetlbfs file. As part of truncation,
>> it unmaps everyone who has the file mapped. If a task has a shared pmd
>> in this range, huge_pmd_unshhare will be called. If this is not the last
>
> (sorry, nitpicking ..) a few typos ("anophter" and "unshhare").

Hi Naoya,

Thanks for looking at the patch. I put this together somewhat quickly before
traveling and unfortunately made several typos. Wanted to provide adequate
documentation to help understand the changes.

>> user sharing the pmd, huge_pmd_unshare will clear pud pointing to the
>> pmd. For the task in the middle of the page fault, the ptep returned by
>> huge_pte_alloc points to another task's page table or worse. This leads
>> to bad things such as incorrect page map/reference counts or invalid
>> memory references.
>>
>> i_mmap_rwsem is currently used for pmd sharing synchronization. It is also
>> held during unmap and whenever a call to huge_pmd_unshare is possible. It
>> is only acquired in write mode. Expand and modify the use of i_mmap_rwsem
>> as follows:
>> - i_mmap_rwsem is held in write mode for the duration of truncate
>> processing.
>> - i_mmap_rwsem is held in write mode whenever huge_pmd_share is called.
>
> I guess you mean huge_pmd_unshare here, right?
>

Correct, i_mmap_rwsem is held in write mode whenever huge_pmd_unshare
is called.

>> - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
>> Today that is only via huge_pte_alloc.
>> - i_mmap_rwsem is held in read mode after huge_pte_alloc, until the caller
>> is finished with the returned ptep.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
[...]
>> @@ -505,8 +512,8 @@ static int hugetlb_vmtruncate(struct inode *inode, loff_t offset)
>> i_mmap_lock_write(mapping);
>> if (!RB_EMPTY_ROOT(&mapping->i_mmap.rb_root))
>> hugetlb_vmdelete_list(&mapping->i_mmap, pgoff, 0);
>> - i_mmap_unlock_write(mapping);
>> remove_inode_hugepages(inode, offset, LLONG_MAX);
>> + i_mmap_unlock_write(mapping);
>
> I just have an impression that hugetlbfs_punch_hole() could have the
> similar race and extending lock range there could be an improvement,
> although I might miss something as always.
>

You are correct. The hole punch routine (hugetlbfs_punch_hole) should
continue to hold i_mmap_rwsem in write mode until after calling
remove_inode_hugepages.

>> return 0;
>> }
>>
>> @@ -624,7 +631,11 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>> /* addr is the offset within the file (zero based) */
>> addr = index * hpage_size;
>>
>> - /* mutex taken here, fault path and hole punch */
>> + /*
>> + * fault mutex taken here, protects against fault path
>> + * and hole punch. inode_lock previously taken protects
>> + * against truncation.
>> + */
>> hash = hugetlb_fault_mutex_hash(h, mm, &pseudo_vma, mapping,
>> index, addr);
>> mutex_lock(&hugetlb_fault_mutex_table[hash]);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 7b5c0ad9a6bd..e9da3eee262f 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3252,18 +3252,33 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>>
>> for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) {
>> spinlock_t *src_ptl, *dst_ptl;
>> + struct vm_area_struct *dst_vma;
>> + struct address_space *mapping;
>> +
>> src_pte = huge_pte_offset(src, addr, sz);
>> if (!src_pte)
>> continue;
>> +
>> + /*
>> + * i_mmap_rwsem must be held to call huge_pte_alloc.
>> + * Continue to hold until finished with dst_pte, otherwise
>> + * it could go away if part of a shared pmd.
>> + */
>> + dst_vma = find_vma(dst, addr);
>> + mapping = dst_vma->vm_file->f_mapping;
>
> If vma->vm_file->f_mapping gives the same mapping, you may omit the find_vma()?
>

Thanks. You are correct. 'dst_vma' should be the same as vma as it is
a copy. This find is unnecessary.

>> + i_mmap_lock_read(mapping);
>> dst_pte = huge_pte_alloc(dst, addr, sz);
>> if (!dst_pte) {
>> + i_mmap_unlock_read(mapping);
>> ret = -ENOMEM;
>> break;
>> }
>>
>> /* If the pagetables are shared don't copy or take references */
>> - if (dst_pte == src_pte)
>> + if (dst_pte == src_pte) {
>> + i_mmap_unlock_read(mapping);
>> continue;
>> + }
>>
>> dst_ptl = huge_pte_lock(h, dst, dst_pte);
>> src_ptl = huge_pte_lockptr(h, src, src_pte);
>
> [...]
>
>> diff --git a/mm/rmap.c b/mm/rmap.c
>> index 1e79fac3186b..db49e734dda8 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -1347,6 +1347,7 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> bool ret = true;
>> unsigned long start = address, end;
>> enum ttu_flags flags = (enum ttu_flags)arg;
>> + bool pmd_sharing_possible = false;
>>
>> /* munlock has nothing to gain from examining un-locked vmas */
>> if ((flags & TTU_MUNLOCK) && !(vma->vm_flags & VM_LOCKED))
>> @@ -1376,8 +1377,15 @@ static bool try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
>> * accordingly.
>> */
>> adjust_range_if_pmd_sharing_possible(vma, &start, &end);
>> + if ((end - start) > (PAGE_SIZE << compound_order(page)))
>> + pmd_sharing_possible = true;
>
> Maybe the similar check is done in adjust_range_if_pmd_sharing_possible()
> as the function name claims, so does it make more sense to get this bool
> value via the return value?

Yes, that makes sense. This use of adjust_range_if_pmd_sharing_possible
would be the only place a return value is used.

Thanks for your comments!

I am concerned about the use of any huge pte pointers when the page table
lock or i_mmap_rwsem is not held. There may be more instances of this we
need to protect. For example, huge_pte_offset at the beginning of
hugetlb_fault() is still called without any synchronization. I think we
may need to acquire i_mmap_rwsem before this call. I'm trying to think of
other areas that may be of concern.
--
Mike Kravetz