Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing

From: Mike Kravetz
Date: Tue Nov 12 2019 - 18:11:30 EST


On 11/12/19 9:27 AM, Waiman Long wrote:
> On 11/8/19 8:47 PM, Mike Kravetz wrote:
>> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>>
>>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>>> held just in read mode. It is OK to do increments in parallel without
>>>>> synchronization. However, we don't want anyone else changing the count
>>>>> while that check in huge_pmd_unshare is happening. Hence, the need for
>>>>> taking the semaphore in write mode.
>>>> This would be a nice addition to the changelog methinks.
>>> Last night I remembered there is one place where we currently take
>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That
>>> is in try_to_unmap_one. Yes, there is a potential race here today.
>> Actually there is no race there today. Callers to huge_pmd_unshare
>> hold the page table lock. So, this synchronizes those unshare calls
>> from page migration and page poisoning.
>>
>>> But that race is somewhat contained as you need two threads doing some
>>> combination of page migration and page poisoning to race. This change
>>> now allows migration or poisoning to race with page fault. I would
>>> really prefer if we do not open up the race window in this manner.
>> But, we do open a race window by changing huge_pmd_share to take the
>> i_mmap_rwsem in read mode as in the original patch.
>>
>> Here is the additional code needed to take the semaphore in write mode
>> for the huge_pmd_unshare calls via try_to_unmap_one. We would need to
>> combine this with Longman's patch. Please take a look and provide feedback.
>> Some of the changes are subtle, especially the exception for MAP_PRIVATE
>> mappings, but I tried to add sufficient comments.
>>
>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
>> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>> Date: Fri, 8 Nov 2019 17:25:37 -0800
>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>> call to huge_pmd_unshare in try_to_unmap_one.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>> ---
>> mm/hugetlb.c | 9 ++++++++-
>> mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>> mm/migrate.c | 27 +++++++++++++++++++++++++--
>> 3 files changed, 60 insertions(+), 4 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index f78891f92765..73d9136549a5 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>> * indicated by page_count > 1, unmap is achieved by clearing pud and
>> * decrementing the ref count. If count == 1, the pte page is not shared.
>> *
>> - * called with page table lock held.
>> + * Must be called while holding page table lock.
>> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
>> + * This is to prevent races with page faults calling huge_pmd_share which
>> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
>> + * in read mode. It is possible to call without holding i_mmap_rwsem in
>> + * write mode if the caller KNOWS the page table is associated with a private
>> + * mapping. This is because private mappings can not share PMDs and can
>> + * not race with huge_pmd_share calls during page faults.
>
> So the page table lock here is the huge_pte_lock(). Right? In
> huge_pmd_share(), the pte lock has to be taken before one can share it.
> So would you mind explaining where exactly is the race?

My concern was that this code in huge_pmd_share:

saddr = page_table_shareable(svma, vma, addr, idx);
if (saddr) {
spte = huge_pte_offset(svma->vm_mm, saddr,
vma_mmu_pagesize(svma));
if (spte) {
get_page(virt_to_page(spte));
break;
}
}

and this code in huge_pmd_unshare:

BUG_ON(page_count(virt_to_page(ptep)) == 0);
if (page_count(virt_to_page(ptep)) == 1)
return 0;

could run concurrently. Note that return code for huge_pmd_unshare is
specified as,

* returns: 1 successfully unmapped a shared pte page
* 0 the underlying pte page is not shared, or it is the last user

And, callers take different action depending on the return value.

Now, since the code blocks above can run concurrently it is possible that:
- huge_pmd_unshare sees page_count == 1
- huge_pmd_share increments page_count in anticipation of sharing
- huge_pmd_unshare returns 0 indicating there is no pmd sharing
- huge_pmd_share waits for page table lock to insert pmd page before it
sharts sharing

My concern was with what might happen if a huge_pmd_unshare caller received
a 0 return code and thought there were no other users of the pmd. In the
specific case mentioned here (try_to_unmap_one) I now do not believe there
are any issues. The caller is simply going to clear one entry on the pmd
page. I can not think of a way this could impact the other thread waiting
to share the pmd. It will simply start sharing the pmd after it gets the
page table lock and the entry has been removed.

As the result of your question, I went back and took a really close look
at the code in question. While that huge_pmd_share/huge_pmd_unshare code
running concurrently does make me a bit nervous, I can not find any specific
problem. In addition, I ran a test causes this race for several hours
without issue.

Sorry for the churn/second thoughts. This code is subtle, and sometimes hard
to understand. IMO, Longman's patch (V2) can go forward, but please delete
the following blob in the commit message from me:

"Note that huge_pmd_share now increments the page count with the
semaphore held just in read mode. It is OK to do increments in
parallel without synchronization. However, we don't want anyone else
changing the count while that check in huge_pmd_unshare is happening.
Hence, the need for taking the semaphore in write mode."

--
Mike Kravetz