Re: [PATCH] hugetlbfs: Take read_lock on i_mmap for PMD sharing

From: Waiman Long
Date: Tue Nov 12 2019 - 21:55:11 EST


On 11/12/19 6:11 PM, Mike Kravetz wrote:
> On 11/12/19 9:27 AM, Waiman Long wrote:
>> On 11/8/19 8:47 PM, Mike Kravetz wrote:
>>> On 11/8/19 11:10 AM, Mike Kravetz wrote:
>>>> On 11/7/19 6:04 PM, Davidlohr Bueso wrote:
>>>>> On Thu, 07 Nov 2019, Mike Kravetz wrote:
>>>>>
>>>>>> Note that huge_pmd_share now increments the page count with the semaphore
>>>>>> held just in read mode. It is OK to do increments in parallel without
>>>>>> synchronization. However, we don't want anyone else changing the count
>>>>>> while that check in huge_pmd_unshare is happening. Hence, the need for
>>>>>> taking the semaphore in write mode.
>>>>> This would be a nice addition to the changelog methinks.
>>>> Last night I remembered there is one place where we currently take
>>>> i_mmap_rwsem in read mode and potentially call huge_pmd_unshare. That
>>>> is in try_to_unmap_one. Yes, there is a potential race here today.
>>> Actually there is no race there today. Callers to huge_pmd_unshare
>>> hold the page table lock. So, this synchronizes those unshare calls
>>> from page migration and page poisoning.
>>>
>>>> But that race is somewhat contained as you need two threads doing some
>>>> combination of page migration and page poisoning to race. This change
>>>> now allows migration or poisoning to race with page fault. I would
>>>> really prefer if we do not open up the race window in this manner.
>>> But, we do open a race window by changing huge_pmd_share to take the
>>> i_mmap_rwsem in read mode as in the original patch.
>>>
>>> Here is the additional code needed to take the semaphore in write mode
>>> for the huge_pmd_unshare calls via try_to_unmap_one. We would need to
>>> combine this with Longman's patch. Please take a look and provide feedback.
>>> Some of the changes are subtle, especially the exception for MAP_PRIVATE
>>> mappings, but I tried to add sufficient comments.
>>>
>>> From 21735818a520705c8573b8d543b8f91aa187bd5d Mon Sep 17 00:00:00 2001
>>> From: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>>> Date: Fri, 8 Nov 2019 17:25:37 -0800
>>> Subject: [PATCH] Changes needed for taking i_mmap_rwsem in write mode before
>>> call to huge_pmd_unshare in try_to_unmap_one.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@xxxxxxxxxx>
>>> ---
>>> mm/hugetlb.c | 9 ++++++++-
>>> mm/memory-failure.c | 28 +++++++++++++++++++++++++++-
>>> mm/migrate.c | 27 +++++++++++++++++++++++++--
>>> 3 files changed, 60 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index f78891f92765..73d9136549a5 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -4883,7 +4883,14 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>>> * indicated by page_count > 1, unmap is achieved by clearing pud and
>>> * decrementing the ref count. If count == 1, the pte page is not shared.
>>> *
>>> - * called with page table lock held.
>>> + * Must be called while holding page table lock.
>>> + * In general, the caller should also hold the i_mmap_rwsem in write mode.
>>> + * This is to prevent races with page faults calling huge_pmd_share which
>>> + * will not be holding the page table lock, but will be holding i_mmap_rwsem
>>> + * in read mode. It is possible to call without holding i_mmap_rwsem in
>>> + * write mode if the caller KNOWS the page table is associated with a private
>>> + * mapping. This is because private mappings can not share PMDs and can
>>> + * not race with huge_pmd_share calls during page faults.
>> So the page table lock here is the huge_pte_lock(). Right? In
>> huge_pmd_share(), the pte lock has to be taken before one can share it.
>> So would you mind explaining where exactly is the race?
> My concern was that this code in huge_pmd_share:
>
> saddr = page_table_shareable(svma, vma, addr, idx);
> if (saddr) {
> spte = huge_pte_offset(svma->vm_mm, saddr,
> vma_mmu_pagesize(svma));
> if (spte) {
> get_page(virt_to_page(spte));
> break;
> }
> }
>
> and this code in huge_pmd_unshare:
>
> BUG_ON(page_count(virt_to_page(ptep)) == 0);
> if (page_count(virt_to_page(ptep)) == 1)
> return 0;
>
> could run concurrently. Note that return code for huge_pmd_unshare is
> specified as,
>
> * returns: 1 successfully unmapped a shared pte page
> * 0 the underlying pte page is not shared, or it is the last user
>
> And, callers take different action depending on the return value.
>
> Now, since the code blocks above can run concurrently it is possible that:
> - huge_pmd_unshare sees page_count == 1
> - huge_pmd_share increments page_count in anticipation of sharing
> - huge_pmd_unshare returns 0 indicating there is no pmd sharing
> - huge_pmd_share waits for page table lock to insert pmd page before it
> sharts sharing
>
> My concern was with what might happen if a huge_pmd_unshare caller received
> a 0 return code and thought there were no other users of the pmd. In the
> specific case mentioned here (try_to_unmap_one) I now do not believe there
> are any issues. The caller is simply going to clear one entry on the pmd
> page. I can not think of a way this could impact the other thread waiting
> to share the pmd. It will simply start sharing the pmd after it gets the
> page table lock and the entry has been removed.
>
> As the result of your question, I went back and took a really close look
> at the code in question. While that huge_pmd_share/huge_pmd_unshare code
> running concurrently does make me a bit nervous, I can not find any specific
> problem. In addition, I ran a test causes this race for several hours
> without issue.
>
> Sorry for the churn/second thoughts. This code is subtle, and sometimes hard
> to understand. IMO, Longman's patch (V2) can go forward, but please delete
> the following blob in the commit message from me:
>
> "Note that huge_pmd_share now increments the page count with the
> semaphore held just in read mode. It is OK to do increments in
> parallel without synchronization. However, we don't want anyone else
> changing the count while that check in huge_pmd_unshare is happening.
> Hence, the need for taking the semaphore in write mode."
>
Thanks for the explanation. I understand that it is often time hard to
figure out if a race exists or not. I got confused myself many times.
Anyway, as long as any destructive action is only taken while holding
the page table lock, it should be safe.

Cheers,
Longman