Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region()
From: Liam R. Howlett
Date: Mon Oct 07 2024 - 21:51:28 EST
* Jann Horn <jannh@xxxxxxxxxx> [241007 17:31]:
> On Mon, Oct 7, 2024 at 10:31 PM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> > * Jann Horn <jannh@xxxxxxxxxx> [241007 15:06]:
> > > On Fri, Aug 30, 2024 at 6:00 AM Liam R. Howlett <Liam.Howlett@xxxxxxxxxx> wrote:
> > > > Instead of zeroing the vma tree and then overwriting the area, let the
> > > > area be overwritten and then clean up the gathered vmas using
> > > > vms_complete_munmap_vmas().
> > > >
> > > > To ensure locking is downgraded correctly, the mm is set regardless of
> > > > MAP_FIXED or not (NULL vma).
> > > >
> > > > If a driver is mapping over an existing vma, then clear the ptes before
> > > > the call_mmap() invocation. This is done using the vms_clean_up_area()
> > > > helper. If there is a close vm_ops, that must also be called to ensure
> > > > any cleanup is done before mapping over the area. This also means that
> > > > calling open has been added to the abort of an unmap operation, for now.
> > >
> > > As currently implemented, this is not a valid optimization because it
> > > violates the (unwritten?) rule that you must not call free_pgd_range()
> > > on a region in the page tables which can concurrently be walked. A
> > > region in the page tables can be concurrently walked if it overlaps a
> > > VMA which is linked into rmaps which are not write-locked.
> >
> > Just for clarity, this is the rmap write lock.
>
> Ah, yes.
>
> > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA, and
> > > the new mapping is created by expanding an adjacent VMA, the following
> > > race with an ftruncate() is possible (because page tables for the old
> > > mapping are removed while the new VMA in the same location is already
> > > fully set up and linked into the rmap):
> > >
> > >
> > > task 1 (mmap, MAP_FIXED) task 2 (ftruncate)
> > > ======================== ==================
> > > mmap_region
> > > vma_merge_new_range
> > > vma_expand
> > > commit_merge
> > > vma_prepare
> > > [take rmap locks]
> > > vma_set_range
> > > [expand adjacent mapping]
> > > vma_complete
> > > [drop rmap locks]
> > > vms_complete_munmap_vmas
> > > vms_clear_ptes
> > > unmap_vmas
> > > [removes ptes]
> > > free_pgtables
> > > [unlinks old vma from rmap]
> > > unmap_mapping_range
> > > unmap_mapping_pages
> > > i_mmap_lock_read
> > > unmap_mapping_range_tree
> > > [loop]
> > > unmap_mapping_range_vma
> > > zap_page_range_single
> > > unmap_single_vma
> > > unmap_page_range
> > > zap_p4d_range
> > > zap_pud_range
> > > zap_pmd_range
> > > [looks up pmd entry]
> > > free_pgd_range
> > > [frees pmd]
> > > [UAF pmd entry access]
> > >
> > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.diff
> > > to widen the race windows, then build and run the attached reproducer
> > > mmap-fixed-race.c.
> > >
> > > Under a kernel with KASAN, you should ideally get a KASAN splat like this:
> >
> > Thanks for all the work you did finding the root cause here, I
> > appreciate it.
>
> Ah, this is not a bug I ran into while testing, it's a bug I found
> while reading the patch. It's much easier to explain the issue and
> come up with a nice reproducer this way than when you start out from a
> crash. :P
>
> > I think the correct fix is to take the rmap lock on free_pgtables, when
> > necessary. There are a few code paths (error recovery) that are not
> > regularly run that will also need to change.
>
> Hmm, yes, I guess that might work. Though I think there might be more
> races: One related aspect of this optimization that is unintuitive to
> me is that, directly after vma_merge_new_range(), a concurrent rmap
> walk could probably be walking the newly-extended VMA but still
> observe PTEs belonging to the previous VMA. I don't know how robust
> the various rmap walks are to things like encountering pfnmap PTEs in
> non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For
> example, page_vma_mapped_walk() looks like, if you called it on a page
> table range with huge PUD entries, but with a VMA without VM_HUGETLB,
> something might go wrong on the "pmd_offset(pud, pvmw->address)" call,
> and a 1G hugepage might get misinterpreted as a page table? But I
> haven't experimentally verified that.
Yes, I am also concerned that reacquiring the lock will result in
another race. I also don't think holding the lock for longer is a good
idea as it will most likely cause a regression by extending the lock for
the duration of the mmap() setup. Although, maybe it would be fine if
we only keep it held if we are going to be removing a vma in the
MAP_FIXED case.
Another idea would be to change the pte to know if a vma is being
modified using the per-vma locking, but I'm not sure what action to take
if we detect the vma is being modified to avoid the issue. This would
also need to be done to all walkers (or low enough in the stack).
By the way, this isn't an optimisation; this is to fix RCU walkers of
the vma tree seeing a hole between the underlying implementation of the
MAP_FIXED operations of munmap() and mmap(). This is needed for things
like the /proc/{pid}/maps rcu walker. The page tables currently fall
back to the old way of locking if a hole is seen (and sane applications
shouldn't really be page faulting something being removed anyways..)
Thanks,
Liam