Re: [RFC v5 PATCH 2/2] mm: mmap: zap pages with read mmap_sem in munmap

From: Laurent Dufour
Date: Tue Jul 24 2018 - 13:32:06 EST




On 24/07/2018 19:26, Yang Shi wrote:
>
>
> On 7/24/18 10:18 AM, Laurent Dufour wrote:
>> On 19/07/2018 01:21, Yang Shi wrote:
>>> When running some mmap/munmap scalability tests with large memory (i.e.
>>>> 300GB), the below hung task issue may happen occasionally.
>>> INFO: task ps:14018 blocked for more than 120 seconds.
>>> ÂÂÂÂÂÂÂ Tainted: GÂÂÂÂÂÂÂÂÂÂÂ E 4.9.79-009.ali3000.alios7.x86_64 #1
>>> Â "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
>>> message.
>>> Â psÂÂÂÂÂÂÂÂÂÂÂÂÂ DÂÂÂ 0 14018ÂÂÂÂÂ 1 0x00000004
>>> ÂÂ ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
>>> ÂÂ ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
>>> ÂÂ 00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
>>> Â Call Trace:
>>> ÂÂ [<ffffffff817154d0>] ? __schedule+0x250/0x730
>>> ÂÂ [<ffffffff817159e6>] schedule+0x36/0x80
>>> ÂÂ [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
>>> ÂÂ [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
>>> ÂÂ [<ffffffff81717db0>] down_read+0x20/0x40
>>> ÂÂ [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
>>> ÂÂ [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
>>> ÂÂ [<ffffffff81241d87>] __vfs_read+0x37/0x150
>>> ÂÂ [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
>>> ÂÂ [<ffffffff81242266>] vfs_read+0x96/0x130
>>> ÂÂ [<ffffffff812437b5>] SyS_read+0x55/0xc0
>>> ÂÂ [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
>>>
>>> It is because munmap holds mmap_sem exclusively from very beginning to
>>> all the way down to the end, and doesn't release it in the middle. When
>>> unmapping large mapping, it may take long time (take ~18 seconds to
>>> unmap 320GB mapping with every single page mapped on an idle machine).
>>>
>>> Zapping pages is the most time consuming part, according to the
>>> suggestion from Michal Hocko [1], zapping pages can be done with holding
>>> read mmap_sem, like what MADV_DONTNEED does. Then re-acquire write
>>> mmap_sem to cleanup vmas.
>>>
>>> But, some part may need write mmap_sem, for example, vma splitting. So,
>>> the design is as follows:
>>> ÂÂÂÂÂÂÂÂ acquire write mmap_sem
>>> ÂÂÂÂÂÂÂÂ lookup vmas (find and split vmas)
>>> ÂÂÂÂdetach vmas
>>> ÂÂÂÂÂÂÂÂ deal with special mappings
>>> ÂÂÂÂÂÂÂÂ downgrade_write
>>>
>>> ÂÂÂÂÂÂÂÂ zap pages
>>> ÂÂÂÂfree page tables
>>> ÂÂÂÂÂÂÂÂ release mmap_sem
>>>
>>> The vm events with read mmap_sem may come in during page zapping, but
>>> since vmas have been detached before, they, i.e. page fault, gup, etc,
>>> will not be able to find valid vma, then just return SIGSEGV or -EFAULT
>>> as expected.
>>>
>>> If the vma has VM_LOCKED | VM_HUGETLB | VM_PFNMAP or uprobe, they are
>>> considered as special mappings. They will be dealt with before zapping
>>> pages with write mmap_sem held. Basically, just update vm_flags.
>>>
>>> And, since they are also manipulated by unmap_single_vma() which is
>>> called by unmap_vma() with read mmap_sem held in this case, to
>>> prevent from updating vm_flags in read critical section, a new
>>> parameter, called "skip_flags" is added to unmap_region(), unmap_vmas()
>>> and unmap_single_vma(). If it is true, then just skip unmap those
>>> special mappings. Currently, the only place which pass true to this
>>> parameter is us.
>>>
>>> With this approach we don't have to re-acquire mmap_sem again to clean
>>> up vmas to avoid race window which might get the address space changed.
>>>
>>> And, since the lock acquire/release cost is managed to the minimum and
>>> almost as same as before, the optimization could be extended to any size
>>> of mapping without incuring significan penalty to small mappings.
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ ^ÂÂÂÂÂÂ ^
>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ incurring significant
>
> Thanks for catching the typo.
>
>>> For the time being, just do this in munmap syscall path. Other
>>> vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
>>> intact for stability reason.
>>>
>>> With the patches, exclusive mmap_sem hold time when munmap a 80GB
>>> address space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to
>>> us level from second.
>>>
>>> munmap_test-15002 [008]ÂÂ 594.380138: funcgraph_entry: |Â
>>> vm_munmap_zap_rlock() {
>>> munmap_test-15002 [008]ÂÂ 594.380146: funcgraph_entry:ÂÂÂÂÂ !2485684 us |ÂÂÂ
>>> unmap_region();
>>> munmap_test-15002 [008]ÂÂ 596.865836: funcgraph_exit:ÂÂÂÂÂÂ !2485692 us |Â }
>>>
>>> Here the excution time of unmap_region() is used to evaluate the time of
>>> holding read mmap_sem, then the remaining time is used with holding
>>> exclusive lock.
>>>
>>> [1] https://lwn.net/Articles/753269/
>>>
>>> Suggested-by: Michal Hocko <mhocko@xxxxxxxxxx>
>>> Suggested-by: Kirill A. Shutemov <kirill@xxxxxxxxxxxxx>
>>> Cc: Matthew Wilcox <willy@xxxxxxxxxxxxx>
>>> Cc: Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>
>>> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>>> Signed-off-by: Yang Shi <yang.shi@xxxxxxxxxxxxxxxxx>
>>> ---
>>> Â include/linux/mm.h |Â 2 +-
>>> Â mm/memory.cÂÂÂÂÂÂÂ | 35 +++++++++++++------
>>> Â mm/mmap.cÂÂÂÂÂÂÂÂÂ | 99
>>> +++++++++++++++++++++++++++++++++++++++++++++++++-----
>>> Â 3 files changed, 117 insertions(+), 19 deletions(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index a0fbb9f..95a4e97 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -1321,7 +1321,7 @@ void zap_vma_ptes(struct vm_area_struct *vma, unsigned
>>> long address,
>>> Â void zap_page_range(struct vm_area_struct *vma, unsigned long address,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂ unsigned long size);
>>> Â void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>> -ÂÂÂÂÂÂÂ unsigned long start, unsigned long end);
>>> +ÂÂÂÂÂÂÂ unsigned long start, unsigned long end, bool skip_flags);
>>>
>>> Â /**
>>> ÂÂ * mm_walk - callbacks for walk_page_range
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 7206a63..00ecdae 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -1514,7 +1514,7 @@ void unmap_page_range(struct mmu_gather *tlb,
>>> Â static void unmap_single_vma(struct mmu_gather *tlb,
>>> ÂÂÂÂÂÂÂÂÂ struct vm_area_struct *vma, unsigned long start_addr,
>>> ÂÂÂÂÂÂÂÂÂ unsigned long end_addr,
>>> -ÂÂÂÂÂÂÂ struct zap_details *details)
>>> +ÂÂÂÂÂÂÂ struct zap_details *details, bool skip_flags)
>>> Â {
>>> ÂÂÂÂÂ unsigned long start = max(vma->vm_start, start_addr);
>>> ÂÂÂÂÂ unsigned long end;
>>> @@ -1525,11 +1525,13 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>>> ÂÂÂÂÂ if (end <= vma->vm_start)
>>> ÂÂÂÂÂÂÂÂÂ return;
>>>
>>> -ÂÂÂ if (vma->vm_file)
>>> -ÂÂÂÂÂÂÂ uprobe_munmap(vma, start, end);
>>> +ÂÂÂ if (!skip_flags) {
>>> +ÂÂÂÂÂÂÂ if (vma->vm_file)
>>> +ÂÂÂÂÂÂÂÂÂÂÂ uprobe_munmap(vma, start, end);
>>>
>>> -ÂÂÂ if (unlikely(vma->vm_flags & VM_PFNMAP))
>>> -ÂÂÂÂÂÂÂ untrack_pfn(vma, 0, 0);
>>> +ÂÂÂÂÂÂÂ if (unlikely(vma->vm_flags & VM_PFNMAP))
>>> +ÂÂÂÂÂÂÂÂÂÂÂ untrack_pfn(vma, 0, 0);
>>> +ÂÂÂ }
>> I think a comment would be welcomed here to detail why it is safe to not call
>> uprobe_munmap() and untrack_pfn() here i.e this has already been done in
>> do_munmap_zap_rlock().
>
> OK
>
>>
>>> ÂÂÂÂÂ if (start != end) {
>>> ÂÂÂÂÂÂÂÂÂ if (unlikely(is_vm_hugetlb_page(vma))) {
>>> @@ -1546,7 +1548,19 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂ if (vma->vm_file) {
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ i_mmap_lock_write(vma->vm_file->f_mapping);
>>> -ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __unmap_hugepage_range_final(tlb, vma, start, end, NULL);
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ if (!skip_flags)
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ /*
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * The vma is being unmapped with read
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * mmap_sem.
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * Can't update vm_flags, it will be
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * updated later with exclusive lock
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * held
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ */
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __unmap_hugepage_range(tlb, vma, start,
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ end, NULL);
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ else
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ __unmap_hugepage_range_final(tlb, vma,
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ start, end, NULL);
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ i_mmap_unlock_write(vma->vm_file->f_mapping);
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂ }
>>> ÂÂÂÂÂÂÂÂÂ } else
>>> @@ -1574,13 +1588,14 @@ static void unmap_single_vma(struct mmu_gather *tlb,
>>> ÂÂ */
>>> Â void unmap_vmas(struct mmu_gather *tlb,
>>> ÂÂÂÂÂÂÂÂÂ struct vm_area_struct *vma, unsigned long start_addr,
>>> -ÂÂÂÂÂÂÂ unsigned long end_addr)
>>> +ÂÂÂÂÂÂÂ unsigned long end_addr, bool skip_flags)
>>> Â {
>>> ÂÂÂÂÂ struct mm_struct *mm = vma->vm_mm;
>>>
>>> ÂÂÂÂÂ mmu_notifier_invalidate_range_start(mm, start_addr, end_addr);
>>> ÂÂÂÂÂ for ( ; vma && vma->vm_start < end_addr; vma = vma->vm_next)
>>> -ÂÂÂÂÂÂÂ unmap_single_vma(tlb, vma, start_addr, end_addr, NULL);
>>> +ÂÂÂÂÂÂÂ unmap_single_vma(tlb, vma, start_addr, end_addr, NULL,
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ skip_flags);
>>> ÂÂÂÂÂ mmu_notifier_invalidate_range_end(mm, start_addr, end_addr);
>>> Â }
>>>
>>> @@ -1604,7 +1619,7 @@ void zap_page_range(struct vm_area_struct *vma,
>>> unsigned long start,
>>> ÂÂÂÂÂ update_hiwater_rss(mm);
>>> ÂÂÂÂÂ mmu_notifier_invalidate_range_start(mm, start, end);
>>> ÂÂÂÂÂ for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
>>> -ÂÂÂÂÂÂÂ unmap_single_vma(&tlb, vma, start, end, NULL);
>>> +ÂÂÂÂÂÂÂ unmap_single_vma(&tlb, vma, start, end, NULL, false);
>>>
>>> ÂÂÂÂÂÂÂÂÂ /*
>>> ÂÂÂÂÂÂÂÂÂÂ * zap_page_range does not specify whether mmap_sem should be
>>> @@ -1641,7 +1656,7 @@ static void zap_page_range_single(struct
>>> vm_area_struct *vma, unsigned long addr
>>> ÂÂÂÂÂ tlb_gather_mmu(&tlb, mm, address, end);
>>> ÂÂÂÂÂ update_hiwater_rss(mm);
>>> ÂÂÂÂÂ mmu_notifier_invalidate_range_start(mm, address, end);
>>> -ÂÂÂ unmap_single_vma(&tlb, vma, address, end, details);
>>> +ÂÂÂ unmap_single_vma(&tlb, vma, address, end, details, false);
>>> ÂÂÂÂÂ mmu_notifier_invalidate_range_end(mm, address, end);
>>> ÂÂÂÂÂ tlb_finish_mmu(&tlb, address, end);
>>> Â }
>>> diff --git a/mm/mmap.c b/mm/mmap.c
>>> index 2504094..f5d5312 100644
>>> --- a/mm/mmap.c
>>> +++ b/mm/mmap.c
>>> @@ -73,7 +73,7 @@
>>>
>>> Â static void unmap_region(struct mm_struct *mm,
>>> ÂÂÂÂÂÂÂÂÂ struct vm_area_struct *vma, struct vm_area_struct *prev,
>>> -ÂÂÂÂÂÂÂ unsigned long start, unsigned long end);
>>> +ÂÂÂÂÂÂÂ unsigned long start, unsigned long end, bool skip_flags);
>>>
>>> Â /* description of effects of mapping type and prot in current implementation.
>>>  * this is due to the limited x86 page protection hardware. The expected
>>> @@ -1824,7 +1824,7 @@ unsigned long mmap_region(struct file *file, unsigned
>>> long addr,
>>> ÂÂÂÂÂ fput(file);
>>>
>>> ÂÂÂÂÂ /* Undo any partial mapping done by a device driver. */
>>> -ÂÂÂ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end);
>>> +ÂÂÂ unmap_region(mm, vma, prev, vma->vm_start, vma->vm_end, false);
>>> ÂÂÂÂÂ charged = 0;
>>> ÂÂÂÂÂ if (vm_flags & VM_SHARED)
>>> ÂÂÂÂÂÂÂÂÂ mapping_unmap_writable(file->f_mapping);
>>> @@ -2559,7 +2559,7 @@ static void remove_vma_list(struct mm_struct *mm,
>>> struct vm_area_struct *vma)
>>> ÂÂ */
>>> Â static void unmap_region(struct mm_struct *mm,
>>> ÂÂÂÂÂÂÂÂÂ struct vm_area_struct *vma, struct vm_area_struct *prev,
>>> -ÂÂÂÂÂÂÂ unsigned long start, unsigned long end)
>>> +ÂÂÂÂÂÂÂ unsigned long start, unsigned long end, bool skip_flags)
>>> Â {
>>> ÂÂÂÂÂ struct vm_area_struct *next = prev ? prev->vm_next : mm->mmap;
>>> ÂÂÂÂÂ struct mmu_gather tlb;
>>> @@ -2567,7 +2567,7 @@ static void unmap_region(struct mm_struct *mm,
>>> ÂÂÂÂÂ lru_add_drain();
>>> ÂÂÂÂÂ tlb_gather_mmu(&tlb, mm, start, end);
>>> ÂÂÂÂÂ update_hiwater_rss(mm);
>>> -ÂÂÂ unmap_vmas(&tlb, vma, start, end);
>>> +ÂÂÂ unmap_vmas(&tlb, vma, start, end, skip_flags);
>>> ÂÂÂÂÂ free_pgtables(&tlb, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
>>> ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ next ? next->vm_start : USER_PGTABLES_CEILING);
>>> ÂÂÂÂÂ tlb_finish_mmu(&tlb, start, end);
>>> @@ -2778,6 +2778,79 @@ static inline void munmap_mlock_vma(struct
>>> vm_area_struct *vma,
>>> ÂÂÂÂÂ }
>>> Â }
>>>
>>> +/*
>>> + * Zap pages with read mmap_sem held
>>> + *
>>> + * uf is the list for userfaultfd
>>> + */
>>> +static int do_munmap_zap_rlock(struct mm_struct *mm, unsigned long start,
>>> +ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ size_t len, struct list_head *uf)
>>> +{
>>> +ÂÂÂ unsigned long end = 0;
>>> +ÂÂÂ struct vm_area_struct *start_vma = NULL, *prev, *vma;
>>> +ÂÂÂ int ret = 0;
>>> +
>>> +ÂÂÂ if (!munmap_addr_sanity(start, len))
>>> +ÂÂÂÂÂÂÂ return -EINVAL;
>>> +
>>> +ÂÂÂ len = PAGE_ALIGN(len);
>>> +
>>> +ÂÂÂ end = start + len;
>>> +
>>> +ÂÂÂ /*
>>> +ÂÂÂÂ * need write mmap_sem to split vmas and detach vmas
>>> +ÂÂÂÂ * splitting vma up-front to save PITA to clean if it is failed
>>> +ÂÂÂÂ */
>>> +ÂÂÂ if (down_write_killable(&mm->mmap_sem))
>>> +ÂÂÂÂÂÂÂ return -EINTR;
>>> +
>>> +ÂÂÂ ret = munmap_lookup_vma(mm, &start_vma, &prev, start, end);
>>> +ÂÂÂ if (ret != 1)
>>> +ÂÂÂÂÂÂÂ goto out;
>>> +
>>> +ÂÂÂ if (unlikely(uf)) {
>>> +ÂÂÂÂÂÂÂ ret = userfaultfd_unmap_prep(start_vma, start, end, uf);
>>> +ÂÂÂÂÂÂÂ if (ret)
>>> +ÂÂÂÂÂÂÂÂÂÂÂ goto out;
>>> +ÂÂÂ }
>>> +
>>> +ÂÂÂ /* Handle mlocked vmas */
>>> +ÂÂÂ if (mm->locked_vm)
>>> +ÂÂÂÂÂÂÂ munmap_mlock_vma(start_vma, end);
>>> +
>>> +ÂÂÂ /* Detach vmas from rbtree */
>>> +ÂÂÂ detach_vmas_to_be_unmapped(mm, start_vma, prev, end);
>>> +
>>> +ÂÂÂ /*
>>> +ÂÂÂÂ * Clear uprobe, VM_PFNMAP and hugetlb mapping in advance since they
>>> +ÂÂÂÂ * need update vm_flags with write mmap_sem
>>> +ÂÂÂÂ */
>>> +ÂÂÂ vma = start_vma;
>>> +ÂÂÂ for ( ; vma && vma->vm_start < end; vma = vma->vm_next) {
>>> +ÂÂÂÂÂÂÂ if (vma->vm_file)
>>> +ÂÂÂÂÂÂÂÂÂÂÂ uprobe_munmap(vma, vma->vm_start, vma->vm_end);
>>> +ÂÂÂÂÂÂÂ if (unlikely(vma->vm_flags & VM_PFNMAP))
>>> +ÂÂÂÂÂÂÂÂÂÂÂ untrack_pfn(vma, 0, 0);130680130680
>>> +ÂÂÂÂÂÂÂ if (is_vm_hugetlb_page(vma))
>>> +ÂÂÂÂÂÂÂÂÂÂÂ vma->vm_flags &= ~VM_MAYSHARE;
>>> +ÂÂÂ }
>>> +
>>> +ÂÂÂ downgrade_write(&mm->mmap_sem);
>>> +
>>> +ÂÂÂ /* zap mappings with read mmap_sem */
>>> +ÂÂÂ unmap_region(mm, start_vma, prev, start, end, true);
>>> +
>>> +ÂÂÂ arch_unmap(mm, start_vma, start, end);
>>> +ÂÂÂ remove_vma_list(mm, start_vma);
>>> +ÂÂÂ up_read(&mm->mmap_sem);
>>> +
>>> +ÂÂÂ return 0;
>>> +
>>> +out:
>>> +ÂÂÂ up_write(&mm->mmap_sem);
>>> +ÂÂÂ return ret;
>>> +}
>>> +
>>> Â /* Munmap is split into 2 main parts -- this part which finds
>>> ÂÂ * what needs doing, and the areas themselves, which do the
>>>  * work. This now handles partial unmappings.
>>> @@ -2826,7 +2899,7 @@ int do_munmap(struct mm_struct *mm, unsigned long
>>> start, size_t len,
>>> ÂÂÂÂÂÂ * Remove the vma's, and unmap the actual pages
>>> ÂÂÂÂÂÂ */
>>> ÂÂÂÂÂ detach_vmas_to_be_unmapped(mm, vma, prev, end);
>>> -ÂÂÂ unmap_region(mm, vma, prev, start, end);
>>> +ÂÂÂ unmap_region(mm, vma, prev, start, end, false);
>>>
>>> ÂÂÂÂÂ arch_unmap(mm, vma, start, end);
>>>
>>> @@ -2836,6 +2909,17 @@ int do_munmap(struct mm_struct *mm, unsigned long
>>> start, size_t len,
>>> ÂÂÂÂÂ return 0;
>>> Â }
>>>
>>> +static int vm_munmap_zap_rlock(unsigned long start, size_t len)
>>> +{
>>> +ÂÂÂ int ret;
>>> +ÂÂÂ struct mm_struct *mm = current->mm;
>>> +ÂÂÂ LIST_HEAD(uf);
>>> +
>>> +ÂÂÂ ret = do_munmap_zap_rlock(mm, start, len, &uf);
>>> +ÂÂÂ userfaultfd_unmap_complete(mm, &uf);
>>> +ÂÂÂ return ret;
>>> +}
>>> +
>>> Â int vm_munmap(unsigned long start, size_t len)
>>> Â {
>>> ÂÂÂÂÂ int ret;
>> A stupid question, since the overhead of vm_munmap_zap_rlock() compared to
>> vm_munmap() is not significant, why not putting that in vm_munmap() instead of
>> introducing a new vm_munmap_zap_rlock() ?
>
> Since vm_munmap() is called in other paths too, i.e. drm driver, kvm, etc. I'm
> not quite sure if those paths are safe enough to this optimization. And, it
> looks they are not the main sources of the latency, so here I introduced
> vm_munmap_zap_rlock() for munmap() only.

For my information, what could be unsafe for these paths ?

>
> If someone reports or we see they are the sources of latency too, and the
> optimization is proved safe to them, we can definitely extend this to all
> vm_munmap() calls
>
> Thanks,
> Yang
>
>>
>>> @@ -2855,10 +2939,9 @@ int vm_munmap(unsigned long start, size_t len)
>>> Â SYSCALL_DEFINE2(munmap, unsigned long, addr, size_t, len)
>>> Â {
>>> ÂÂÂÂÂ profile_munmap(addr);
>>> -ÂÂÂ return vm_munmap(addr, len);
>>> +ÂÂÂ return vm_munmap_zap_rlock(addr, len);
>>> Â }
>>>
>>> -
>>> Â /*
>>> ÂÂ * Emulation of deprecated remap_file_pages() syscall.
>>> ÂÂ */
>>> @@ -3146,7 +3229,7 @@ void exit_mmap(struct mm_struct *mm)
>>> ÂÂÂÂÂ tlb_gather_mmu(&tlb, mm, 0, -1);
>>> ÂÂÂÂÂ /* update_hiwater_rss(mm) here? but nobody should be looking */
>>> ÂÂÂÂÂ /* Use -1 here to ensure all VMAs in the mm are unmapped */
>>> -ÂÂÂ unmap_vmas(&tlb, vma, 0, -1);
>>> +ÂÂÂ unmap_vmas(&tlb, vma, 0, -1, false);
>>> ÂÂÂÂÂ free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
>>> ÂÂÂÂÂ tlb_finish_mmu(&tlb, 0, -1);
>>>
>