Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP

From: Lorenzo Stoakes (Oracle)

Date: Thu Apr 02 2026 - 10:24:33 EST


On Thu, Apr 02, 2026 at 03:55:27PM +0200, David Hildenbrand (Arm) wrote:
> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Apr 02, 2026 at 07:13:32PM +0800, Qi Tang wrote:
> >> prctl_set_mm_map() allows modifying all mm_struct boundaries and
> >> the saved auxv vector. The individual field path (PR_SET_MM_START_CODE
> >> etc.) correctly requires CAP_SYS_RESOURCE, but the PR_SET_MM_MAP path
> >> dispatches before this check and has no capability requirement of its
> >> own when exe_fd is -1.
> >>
> >> This means any unprivileged user on a CONFIG_CHECKPOINT_RESTORE kernel
> >> (nearly all distros) can rewrite mm boundaries including start_brk, brk,
> >> arg_start/end, env_start/end and saved_auxv. Consequences include:
> >>
> >> - SELinux PROCESS__EXECHEAP bypass via start_brk manipulation
> >> - procfs info disclosure by pointing arg/env ranges at other memory
> >> - auxv poisoning (AT_SYSINFO_EHDR, AT_BASE, AT_ENTRY)
> >>
> >> The original commit f606b77f1a9e ("prctl: PR_SET_MM -- introduce
> >> PR_SET_MM_MAP operation") states "we require the caller to be at least
> >> user-namespace root user", but this was never enforced in the code.
> >>
> >> Add a checkpoint_restore_ns_capable() check at the top of
> >> prctl_set_mm_map(), after the PR_SET_MM_MAP_SIZE early return. This
> >> requires CAP_CHECKPOINT_RESTORE or CAP_SYS_ADMIN in the caller's
> >> user namespace, matching the stated design intent and the existing
> >> check for exe_fd changes.
> >>
> >> Fixes: f606b77f1a9e ("prctl: PR_SET_MM -- introduce PR_SET_MM_MAP operation")
> >
> > We've had a gaping security hole since 2014 and nobody noticed? I find it
> > hard to believe.
> >
> >> Cc: stable@xxxxxxxxxxxxxxx
> >> Cc: Cyrill Gorcunov <gorcunov@xxxxxxxxxx>
> >> Signed-off-by: Qi Tang <tpluszz77@xxxxxxxxx>
> >> ---
> >> kernel/sys.c | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index c86eba9aa7e9..2b8c57f23a35 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -2071,6 +2071,9 @@ static int prctl_set_mm_map(int opt, const void __user *addr, unsigned long data
> >> return put_user((unsigned int)sizeof(prctl_map),
> >> (unsigned int __user *)addr);
> >>
> >> + if (!checkpoint_restore_ns_capable(current_user_ns()))
> >> + return -EPERM;
> >
> > Hmm there is already:
> >
> > if (prctl_map.exe_fd != (u32)-1) {
> > /*
> > * Check if the current user is checkpoint/restore capable.
> > * At the time of this writing, it checks for CAP_SYS_ADMIN
> > * or CAP_CHECKPOINT_RESTORE.
> > * Note that a user with access to ptrace can masquerade an
> > * arbitrary program as any executable, even setuid ones.
> > * This may have implications in the tomoyo subsystem.
> > */
> > if (!checkpoint_restore_ns_capable(current_user_ns()))
> > return -EPERM;
> >
> > And you're proposing _adding_ this check on top of that? Seems super
> > redundant.
>
> Yes, should be moved.

Well, I don't think this patch should be applied at all...

>
> >
> > but also, this seems super-specific buuut... Then again #ifdef
> > CONFIG_CHECKPOINT_RESTORE around this. Ugh.
> >
> > I _hate_ this inteface. HATE HATE HATE it.
> >
> > Anyway, does updating _your own_ auxv really require elevated permissions
> > like this?
> >
> > I don't think so? Couldn't you go and manipulate that anyway without
> > elevated anything?
>
> Hard to believe ...
>
> I was wondering whether this could break some users. At least CRIU doc
> states:
>
> This option tells *criu* to accept the limitations when running
> as non-root. Running as non-root requires *criu* at least to have
> *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
> running *criu* as non-root please consult the *NON-ROOT* section.

Hmm. I wonder if we don't have more users than that though? Hard to rule out
some weird program somewhere using it for some strange reason.

Commit ebd6de681238 ("prctl: Allow local CAP_CHECKPOINT_RESTORE to change
/proc/self/exe") explicitly _only_ restricted the exe link.

So maybe these comment is in reference to _other_ operations other than non-exe
changing PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE?

>
> I mean, the check makes sense given that prctl_set_mm() rejects all
> these operations without CAP_SYS_RESOURCE.

Hmm but the CAP_SYS_RESOURCE check is only applicable to commands other than
PR_SET_MM_MAP or PR_SET_MM_MAP_SIZE?

#ifdef CONFIG_CHECKPOINT_RESTORE
if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
#endif

if (!capable(CAP_SYS_RESOURCE))
return -EPERM;

... rest ...

>
>
> CAP_CHECKPOINT_RESTORE was not introduced before
>
> commit 124ea650d3072b005457faed69909221c2905a1f
> Author: Adrian Reber <areber@xxxxxxxxxx>
> Date: Sun Jul 19 12:04:11 2020 +0200
>
> capabilities: Introduce CAP_CHECKPOINT_RESTORE
>
> So at the time PR_SET_MM_MAP was added there simply was no such capability.
>
> Likely, now that we have it, we should indeed use it.

But we did start using it in the exec_fd != -1 case?

Hmm actually sorry it does more than just manipulating auxv, you can change a
bunch of mm->... stuff.

But if it's your process does it really matter? You can manipulate memory all
over the place in your process...

>
> --
> Cheers,
>
> David

Thanks, Lorenzo