Re: [PATCH v2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

From: Mike Kravetz
Date: Mon Oct 24 2022 - 20:44:25 EST


On 10/24/22 14:55, Mike Kravetz wrote:
> On 10/22/22 19:50, Mike Kravetz wrote:
> > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
> > page tables associated with the address range. For hugetlb vmas,
> > zap_page_range will call __unmap_hugepage_range_final. However,
> > __unmap_hugepage_range_final assumes the passed vma is about to be
> > removed and deletes the vma_lock to prevent pmd sharing as the vma is
> > on the way out. In the case of madvise(MADV_DONTNEED) the vma remains,
> > but the missing vma_lock prevents pmd sharing and could potentially
> > lead to issues with truncation/fault races.
> >
> > This issue was originally reported here [1] as a BUG triggered in
> > page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
> > vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> > prevent pmd sharing. Subsequent faults on this vma were confused as
> > VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
> > was not set in new pages added to the page table. This resulted in
> > pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
> >
> > Create a new routine clear_hugetlb_page_range() that can be called from
> > madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as
> > zap_page_range, but does not delete the vma_lock.
>
> After seeing a syzbot use after free report [2] that is also addressed by
> this patch, I started thinking ...
>
> When __unmap_hugepage_range_final was created, the only time unmap_single_vma
> was called for hugetlb vmas was during process exit time via exit_mmap. I got
> in trouble when I added a call via madvise(MADV_DONTNEED) which calls
> zap_page_range. This patch takes care of that calling path by having
> madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of
> zap_page_range for hugetlb vmas. The use after free bug had me auditing code
> paths to make sure __unmap_hugepage_range_final was REALLY only called at
> process exit time. If not, and we could fault on a vma after calling
> __unmap_hugepage_range_final we would be in trouble.
>
> My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users
> to determine if it was being called in the process exit path? If !mm_users,
> then we can delete the vma_lock to prevent pmd sharing as we know the process
> is exiting. If not, we do not delete the lock. That seems to be more robust
> and would prevent issues if someone accidentally introduces a new code path
> where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma)
> could be called outside process exit context.
>
> Thoughts?
>
> [2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@xxxxxxxxxx/

Sorry if this seems like I am talking to myself. Here is a proposed v3 as
described above.