Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()

From: Liam R. Howlett
Date: Mon Jul 08 2024 - 16:43:49 EST


* Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> [240708 08:53]:
> On Thu, Jul 04, 2024 at 02:27:18PM GMT, Liam R. Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@xxxxxxxxxx>
> >
> > The MAP_FIXED page count is available after the vms_gather_munmap_vmas()
> > call, so use it instead of looping over the vmas twice.
>
> Predictably indeed you removed the thing I commented on in the last patch
> ;) but at least this time I predicted it! ;)
>
> >
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@xxxxxxxxxx>
> > ---
> > mm/mmap.c | 36 ++++--------------------------------
> > 1 file changed, 4 insertions(+), 32 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index b2de26683903..62edaabf3987 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -400,27 +400,6 @@ anon_vma_interval_tree_post_update_vma(struct vm_area_struct *vma)
> > anon_vma_interval_tree_insert(avc, &avc->anon_vma->rb_root);
> > }
> >
> > -static unsigned long count_vma_pages_range(struct mm_struct *mm,
> > - unsigned long addr, unsigned long end,
> > - unsigned long *nr_accounted)
> > -{
> > - VMA_ITERATOR(vmi, mm, addr);
> > - struct vm_area_struct *vma;
> > - unsigned long nr_pages = 0;
> > -
> > - *nr_accounted = 0;
> > - for_each_vma_range(vmi, vma, end) {
> > - unsigned long vm_start = max(addr, vma->vm_start);
> > - unsigned long vm_end = min(end, vma->vm_end);
> > -
> > - nr_pages += PHYS_PFN(vm_end - vm_start);
> > - if (vma->vm_flags & VM_ACCOUNT)
> > - *nr_accounted += PHYS_PFN(vm_end - vm_start);
> > - }
> > -
> > - return nr_pages;
> > -}
> > -
> > static void __vma_link_file(struct vm_area_struct *vma,
> > struct address_space *mapping)
> > {
> > @@ -2946,17 +2925,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > pgoff_t vm_pgoff;
> > int error = -ENOMEM;
> > VMA_ITERATOR(vmi, mm, addr);
> > - unsigned long nr_pages, nr_accounted;
> > -
> > - nr_pages = count_vma_pages_range(mm, addr, end, &nr_accounted);
> > -
> > - /* Check against address space limit. */
> > - /*
> > - * MAP_FIXED may remove pages of mappings that intersects with requested
> > - * mapping. Account for the pages it would unmap.
> > - */
> > - if (!may_expand_vm(mm, vm_flags, pglen - nr_pages))
> > - return -ENOMEM;
> >
> > if (unlikely(!can_modify_mm(mm, addr, end)))
> > return -EPERM;
> > @@ -2987,6 +2955,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
> > vma_iter_next_range(&vmi);
> > }
> >
> > + /* Check against address space limit. */
> > + if (!may_expand_vm(mm, vm_flags, pglen - vms.nr_pages))
> > + goto abort_munmap;
> > +
>
> I know you can literally only do this after the vms_gather_munmap_vmas(),
> but this does change where we check this, so for instance we do
> arch_unmap() without having checked may_expand_vm().
>
> However I assume this is fine?

Thanks for pointing this out.

The functionality here has changed
--- from ---
may_expand_vm() check
can_modify_mm() check
arch_unmap()
vms_gather_munmap_vmas()
...

--- to ---
can_modify_mm() check
arch_unmap()
vms_gather_munmap_vmas()
may_expand_vm() check
...

vms_gather_munmap_vmas() does nothing but figures out what to do later,
but could use memory and can fail.

The user implications are:

1. The return type on the error may change to -EPERM from -ENOMEM, if
you are not allowed to expand and are trying to overwrite mseal()'ed
VMAs. That seems so very rare that I'm not sure it's worth mentioning.


2. arch_unmap() called prior to may_expand_vm().
powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
within the unmap range. User implication of this means that an
application my set the vdso to NULL prior to hitting the -ENOMEM case in
may_expand_vm() due to the address space limit.

Assuming the removal of the vdso does not cause the application to seg
fault, then the user visible change is that any vdso call after a failed
mmap(MAP_FIXED) call would result in a seg fault. The only reason it
would fail is if the mapping process was attempting to map a large
enough area over the vdso (which is accounted and in the vma tree,
afaict) and ran out of memory. Note that this situation could arise
already since we could run out of memory (not accounting) after the
arch_unmap() call within the kernel.

The code today can suffer the same fate, but not by the accounting
failure. It can happen due to failure to allocate a new vma,
do_vmi_munmap() failure after the arch_unmap() call, or any of the other
failure scenarios later in the mmap_region() function.

At the very least, this requires an expanded change log.

>
> > /*
> > * Private writable mapping: check memory availability
> > */
> > --
> > 2.43.0
> >
>
> Looks good to me generally,
>
> Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx>