Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
From: Lorenzo Stoakes
Date: Wed Jul 10 2024 - 08:46:34 EST
On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote:
> Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> writes:
> > On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
> >>
> ...
> >> The functionality here has changed
> >> --- from ---
> >> may_expand_vm() check
> >> can_modify_mm() check
> >> arch_unmap()
> >> vms_gather_munmap_vmas()
> >> ...
> >>
> >> --- to ---
> >> can_modify_mm() check
> >> arch_unmap()
> >> vms_gather_munmap_vmas()
> >> may_expand_vm() check
> >> ...
> >>
> >> vms_gather_munmap_vmas() does nothing but figures out what to do later,
> >> but could use memory and can fail.
> >>
> >> The user implications are:
> >>
> >> 1. The return type on the error may change to -EPERM from -ENOMEM, if
> >> you are not allowed to expand and are trying to overwrite mseal()'ed
> >> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
> >>
> >>
> >> 2. arch_unmap() called prior to may_expand_vm().
> >> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
> >> within the unmap range. User implication of this means that an
> >> application my set the vdso to NULL prior to hitting the -ENOMEM case in
> >> may_expand_vm() due to the address space limit.
> >>
> >> Assuming the removal of the vdso does not cause the application to seg
> >> fault, then the user visible change is that any vdso call after a failed
> >> mmap(MAP_FIXED) call would result in a seg fault. The only reason it
> >> would fail is if the mapping process was attempting to map a large
> >> enough area over the vdso (which is accounted and in the vma tree,
> >> afaict) and ran out of memory. Note that this situation could arise
> >> already since we could run out of memory (not accounting) after the
> >> arch_unmap() call within the kernel.
> >>
> >> The code today can suffer the same fate, but not by the accounting
> >> failure. It can happen due to failure to allocate a new vma,
> >> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
> >> failure scenarios later in the mmap_region() function.
> >>
> >> At the very least, this requires an expanded change log.
> >
> > Indeed, also (as mentioned on IRC) I feel like we need to look at whether
> > we _truly_ need this arch_unmap() call for a single, rather antiquated,
> > architecture.
>
> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
> factually wrong :) Power10 came out of the fab just a few years ago at
> 7nm.
Fair point ;) perhaps we could go with "rarified"? :>)
>
> > I mean why are they unmapping the VDSO, why is that valid, why does it need
> > that field to be set to NULL, is it possible to signify that in some other
> > way etc.?
>
> It was originally for CRIU. So a niche workload on a niche architecture.
>
> But from the commit that added it, it sounds like CRIU was using mremap,
> which should be handled these days by vdso_mremap(). So it could be that
> arch_unmap() is not actually needed for CRIU anymore.
Oh that's interesting!
>
> Then I guess we have to decide if removing our arch_unmap() would be an
> ABI break, regardless of whether CRIU needs it or not.
Seems to me like an internal implementation detail that should hopefully
not result in anything that should have visible ABI impact?
I guess this is something we ought to assess. It would be useful to
eliminate hooks where we can so we can better control VMA behaviour without
having to worry about an arch being able to do arbitrary things at
unexpected times, especially pertinent where we change the order of things.
>
> cheers
Thanks for taking a look!