Re: [PATCH v3 16/16] mm/mmap: Move may_expand_vm() check in mmap_region()
From: LEROY Christophe
Date: Wed Jul 10 2024 - 08:59:57 EST
Le 10/07/2024 à 14:45, Lorenzo Stoakes a écrit :
> On Wed, Jul 10, 2024 at 10:28:01PM GMT, Michael Ellerman wrote:
>> Lorenzo Stoakes <lorenzo.stoakes@xxxxxxxxxx> writes:
>>> On Mon, Jul 08, 2024 at 04:43:15PM GMT, Liam R. Howlett wrote:
>>>>
>> ...
>>>> The functionality here has changed
>>>> --- from ---
>>>> may_expand_vm() check
>>>> can_modify_mm() check
>>>> arch_unmap()
>>>> vms_gather_munmap_vmas()
>>>> ...
>>>>
>>>> --- to ---
>>>> can_modify_mm() check
>>>> arch_unmap()
>>>> vms_gather_munmap_vmas()
>>>> may_expand_vm() check
>>>> ...
>>>>
>>>> vms_gather_munmap_vmas() does nothing but figures out what to do later,
>>>> but could use memory and can fail.
>>>>
>>>> The user implications are:
>>>>
>>>> 1. The return type on the error may change to -EPERM from -ENOMEM, if
>>>> you are not allowed to expand and are trying to overwrite mseal()'ed
>>>> VMAs. That seems so very rare that I'm not sure it's worth mentioning.
>>>>
>>>>
>>>> 2. arch_unmap() called prior to may_expand_vm().
>>>> powerpc uses this to set mm->context.vdso = NULL if mm->context.vdso is
>>>> within the unmap range. User implication of this means that an
>>>> application my set the vdso to NULL prior to hitting the -ENOMEM case in
>>>> may_expand_vm() due to the address space limit.
>>>>
>>>> Assuming the removal of the vdso does not cause the application to seg
>>>> fault, then the user visible change is that any vdso call after a failed
>>>> mmap(MAP_FIXED) call would result in a seg fault. The only reason it
>>>> would fail is if the mapping process was attempting to map a large
>>>> enough area over the vdso (which is accounted and in the vma tree,
>>>> afaict) and ran out of memory. Note that this situation could arise
>>>> already since we could run out of memory (not accounting) after the
>>>> arch_unmap() call within the kernel.
>>>>
>>>> The code today can suffer the same fate, but not by the accounting
>>>> failure. It can happen due to failure to allocate a new vma,
>>>> do_vmi_munmap() failure after the arch_unmap() call, or any of the other
>>>> failure scenarios later in the mmap_region() function.
>>>>
>>>> At the very least, this requires an expanded change log.
>>>
>>> Indeed, also (as mentioned on IRC) I feel like we need to look at whether
>>> we _truly_ need this arch_unmap() call for a single, rather antiquated,
>>> architecture.
>>
>> You can call it "niche" or "irrelevant" or "fringe", but "antiquated" is
>> factually wrong :) Power10 came out of the fab just a few years ago at
>> 7nm.
>
> Fair point ;) perhaps we could go with "rarified"? :>)
>
>>
>>> I mean why are they unmapping the VDSO, why is that valid, why does it need
>>> that field to be set to NULL, is it possible to signify that in some other
>>> way etc.?
>>
>> It was originally for CRIU. So a niche workload on a niche architecture.
>>
>> But from the commit that added it, it sounds like CRIU was using mremap,
>> which should be handled these days by vdso_mremap(). So it could be that
>> arch_unmap() is not actually needed for CRIU anymore.
>
> Oh that's interesting!
>
>>
>> Then I guess we have to decide if removing our arch_unmap() would be an
>> ABI break, regardless of whether CRIU needs it or not.
>
> Seems to me like an internal implementation detail that should hopefully
> not result in anything that should have visible ABI impact?
>
> I guess this is something we ought to assess. It would be useful to
> eliminate hooks where we can so we can better control VMA behaviour without
> having to worry about an arch being able to do arbitrary things at
> unexpected times, especially pertinent where we change the order of things.
>
I see you are talking about arch_unmap(). I didn't follow the entire
discussion but we have some related stuff here:
https://github.com/linuxppc/issues/issues/241
If I remember correctly arch_unmap() should have gone away we Dmitry's
series
https://lore.kernel.org/lkml/20210611180242.711399-1-dima@xxxxxxxxxx/#r
but it hasn't been applied yet.
Christophe