Re: [PATCH v2] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing

From: Mike Kravetz
Date: Mon Oct 24 2022 - 19:39:57 EST


On 10/22/22 19:50, Mike Kravetz wrote:
> madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the
> page tables associated with the address range. For hugetlb vmas,
> zap_page_range will call __unmap_hugepage_range_final. However,
> __unmap_hugepage_range_final assumes the passed vma is about to be
> removed and deletes the vma_lock to prevent pmd sharing as the vma is
> on the way out. In the case of madvise(MADV_DONTNEED) the vma remains,
> but the missing vma_lock prevents pmd sharing and could potentially
> lead to issues with truncation/fault races.
>
> This issue was originally reported here [1] as a BUG triggered in
> page_try_dup_anon_rmap. Prior to the introduction of the hugetlb
> vma_lock, __unmap_hugepage_range_final cleared the VM_MAYSHARE flag to
> prevent pmd sharing. Subsequent faults on this vma were confused as
> VM_MAYSHARE indicates a sharable vma, but was not set so page_mapping
> was not set in new pages added to the page table. This resulted in
> pages that appeared anonymous in a VM_SHARED vma and triggered the BUG.
>
> Create a new routine clear_hugetlb_page_range() that can be called from
> madvise(MADV_DONTNEED) for hugetlb vmas. It has the same setup as
> zap_page_range, but does not delete the vma_lock.

After seeing a syzbot use after free report [2] that is also addressed by
this patch, I started thinking ...

When __unmap_hugepage_range_final was created, the only time unmap_single_vma
was called for hugetlb vmas was during process exit time via exit_mmap. I got
in trouble when I added a call via madvise(MADV_DONTNEED) which calls
zap_page_range. This patch takes care of that calling path by having
madvise(MADV_DONTNEED) call a new routine clear_hugetlb_page_range instead of
zap_page_range for hugetlb vmas. The use after free bug had me auditing code
paths to make sure __unmap_hugepage_range_final was REALLY only called at
process exit time. If not, and we could fault on a vma after calling
__unmap_hugepage_range_final we would be in trouble.

My thought was, what if we had __unmap_hugepage_range_final check mm->mm_users
to determine if it was being called in the process exit path? If !mm_users,
then we can delete the vma_lock to prevent pmd sharing as we know the process
is exiting. If not, we do not delete the lock. That seems to be more robust
and would prevent issues if someone accidentally introduces a new code path
where __unmap_hugepage_range_final (unmap_single_vma for a hugetlb vma)
could be called outside process exit context.

Thoughts?

[2] https://lore.kernel.org/linux-mm/000000000000d5e00a05e834962e@xxxxxxxxxx/
--
Mike Kravetz