Re: [LKP] Re: [hugetlbfs] c0d0381ade: vm-scalability.throughput -33.4% regression

From: Mike Kravetz
Date: Fri Aug 21 2020 - 17:02:59 EST


On 8/21/20 1:39 AM, Xing Zhengjun wrote:
>
>
> On 6/26/2020 5:33 AM, Mike Kravetz wrote:
>> On 6/22/20 3:01 PM, Mike Kravetz wrote:
>>> On 6/21/20 5:55 PM, kernel test robot wrote:
>>>> Greeting,
>>>>
>>>> FYI, we noticed a -33.4% regression of vm-scalability.throughput due to commit:
>>>>
>>>>
>>>> commit: c0d0381ade79885c04a04c303284b040616b116e ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>
>>>> in testcase: vm-scalability
>>>> on test machine: 288 threads Intel(R) Xeon Phi(TM) CPU 7295 @ 1.50GHz with 80G memory
>>>> with following parameters:
>>>>
>>>> runtime: 300s
>>>> size: 8T
>>>> test: anon-cow-seq-hugetlb
>>>> cpufreq_governor: performance
>>>> ucode: 0x11
>>>>
>>>
>>> Some performance regression is not surprising as the change includes acquiring
>>> and holding the i_mmap_rwsem (in read mode) during hugetlb page faults. 33.4%
>>> seems a bit high. But, the test is primarily exercising the hugetlb page
>>> fault path and little else.
>>>
>>> The reason for taking the i_mmap_rwsem is to prevent PMD unsharing from
>>> invalidating the pmd we are operating on. This specific test case is operating
>>> on anonymous private mappings. So, PMD sharing is not possible and we can
>>> eliminate acquiring the mutex in this case. In fact, we should check all
>>> mappings (even sharable) for the possibly of PMD sharing and only take the
>>> mutex if necessary. It will make the code a bit uglier, but will take care
>>> of some of these regressions. We still need to take the mutex in the case
>>> of PMD sharing. I'm afraid a regression is unavoidable in that case.
>>>
>>> I'll put together a patch.
>>
>> Not acquiring the mutex on faults when sharing is not possible is quite
>> straight forward. We can even use the existing routine vma_shareable()
>> to easily check. However, the next patch in the series 87bf91d39bb5
>> "hugetlbfs: Use i_mmap_rwsem to address page fault/truncate race" depends
>> on always acquiring the mutex. If we break this assumption, then the
>> code to back out hugetlb reservations needs to be written. A high level
>> view of what needs to be done is in the commit message for 87bf91d39bb5.
>>
>> I'm working on the code to back out reservations.
>>
>
> I find that 34ae204f18519f0920bd50a644abd6fefc8dbfcf(hugetlbfs: remove call to huge_pte_alloc without i_mmap_rwsem) fixed this regression, I test with the patch, the regression reduced to 10.1%, do you have plan to continue to improve it? Thanks.

Thank you for testing!

Commit 34ae204f1851 was not created to address performance. Rather it was
created more for the sake of correctness.

IIRC, this test is going to stress the page fault path by continuing to produce
faults for the duration of the test. Commit 34ae204f1851 removed an unneeded,
somewhat dangerous and redundant call to huge_pte_alloc in the fault path.
Your testing showed that removing this call helped address a good amount of the
performance regression. The series proposed in this thread attempts to
eliminate more potentially unnecessary code in the hugetlb fault path.
Specifically, it will only acquire the i_mmap_rwsem when necessary.

Would you be willing to test this series on top of 34ae204f1851? I will need
to rebase the series to take the changes made by 34ae204f1851 into account.

BTW - I believe that shared hugetlb mappings are the most common use case.
In the shared case, acquiring i_mmap_rwsem is necessary if pmd sharing is
possible. There is little we can do to eliminate this. The test suite does
not appear to test shared mappings.
--
Mike Kravetz