Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP

From: David Hildenbrand (Arm)

Date: Thu Apr 02 2026 - 10:13:38 EST


On 4/2/26 15:55, David Hildenbrand (Arm) wrote:
> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
>> On Thu, Apr 02, 2026 at 07:13:32PM +0800, Qi Tang wrote:
>>> prctl_set_mm_map() allows modifying all mm_struct boundaries and
>>> the saved auxv vector. The individual field path (PR_SET_MM_START_CODE
>>> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
>>> dispatches before this check and has no capability requirement of its
>>> own when exe_fd is -1.
>>>
>>> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
>>> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
>>> arg_start/end, env_start/end and saved_auxv. Consequences include:
>>>
>>> - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
>>> - procfs info disclosure by pointing arg/env ranges at other memory
>>> - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
>>>
>>> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
>>> PR_SET_MM_MAP operation") states "we require the caller to be at least
>>> user-namespace root user", but this was never enforced in the code.
>>>
>>> Add a checkpoint_restore_ns_capable() check at the top of
>>> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return. This
>>> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
>>> user namespace, matching the stated design intent and the existing
>>> check for exe_fd changes.
>>>
>>> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
>>
>> We've had a gaping security hole since 2014 and nobody noticed? I find it
>> hard to believe.
>>
>>> Cc: stable@xxxxxxxxxxxxxxx
>>> Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
>>> Signed-off-by: Qi Tang <tpluszz77@xxxxxxxxx>
>>> ---
>>> kernel/sys.c | 3 +++
>>> 1 file changed, 3 insertions(+)
>>>
>>> diff --git a/kernel/sys.c b/kernel/sys.c
>>> index c86eba9aa7e9..2b8c57f23a35 100644
>>> --- a/kernel/sys.c
>>> +++ b/kernel/sys.c
>>> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
>>> return put_user((unsigned int)sizeof(prctl_map),
>>> (unsigned int __user *)addr);
>>>
>>> + if (!checkpoint_restore_ns_capable(current_user_ns()))
>>> + return -EPERM;
>>
>> Hmm there is already:
>>
>> if (prctl_map.exe_fd != (u32)-1) {
>> /*
>> * Check if the current user is checkpoint/restore capable.
>> * At the time of this writing, it checks for CAP_SYS_ADMIN
>> * or CAP_CHECKPOINT_RESTORE.
>> * Note that a user with access to ptrace can masquerade an
>> * arbitrary program as any executable, even setuid ones.
>> * This may have implications in the tomoyo subsystem.
>> */
>> if (!checkpoint_restore_ns_capable(current_user_ns()))
>> return -EPERM;
>>
>> And you're proposing _adding_ this check on top of that? Seems super
>> redundant.
>
> Yes, should be moved.
>
>>
>> but also, this seems super-specific buuut... Then again #ifdef
>> CONFIG_CHECKPOINT_RESTORE around this. Ugh.
>>
>> I _hate_ this inteface. HATE HATE HATE it.
>>
>> Anyway, does updating _your own_ auxv really require elevated permissions
>> like this?
>>
>> I don't think so? Couldn't you go and manipulate that anyway without
>> elevated anything?
>
> Hard to believe ...
>
> I was wondering whether this could break some users. At least CRIU doc
> states:
>
> This option tells *criu* to accept the limitations when running
> as non-root. Running as non-root requires *criu* at least to have
> *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
> running *criu* as non-root please consult the *NON-ROOT* section.

Doing some digging, lxc seems to use that interface.

https://github.com/lxc/lxc/blob/3ee89c5d95ee8f31bd81623fd73ad7beea4297f8/src/lxc/initutils.c#L311

I have no clue about capabilities there.

--
Cheers,

David