Re: [PATCH] prctl: require checkpoint_restore_ns_capable for PR_SET_MM_MAP

From: Andrei Vagin

Date: Thu Apr 02 2026 - 13:48:40 EST


On Thu, Apr 2, 2026 at 7:29 AM David Hildenbrand (Arm) <david@xxxxxxxxxx> wrote:
>
> On 4/2/26 16:21, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Apr 02, 2026 at 03:55:27PM +0200, David Hildenbrand (Arm) wrote:
> >> On 4/2/26 15:06, Lorenzo Stoakes (Oracle) wrote:
> >>>
> >>> We've had a gaping security hole since 2014 and nobody noticed? I find it
> >>> hard to believe.
> >>>
> >>>
> >>> Hmm there is already:
> >>>
> >>> if (prctl_map.exe_fd != (u32)-1) {
> >>> /*
> >>> * Check if the current user is checkpoint/restore capable.
> >>> * At the time of this writing, it checks for CAP_SYS_ADMIN
> >>> * or CAP_CHECKPOINT_RESTORE.
> >>> * Note that a user with access to ptrace can masquerade an
> >>> * arbitrary program as any executable, even setuid ones.
> >>> * This may have implications in the tomoyo subsystem.
> >>> */
> >>> if (!checkpoint_restore_ns_capable(current_user_ns()))
> >>> return -EPERM;
> >>>
> >>> And you're proposing _adding_ this check on top of that? Seems super
> >>> redundant.
> >>
> >> Yes, should be moved.
> >
> > Well, I don't think this patch should be applied at all...
> >
>
> I mean a v2 would have to do that. Whether we would merge that is
> another discussion :)
>
> >>
> >>>
> >>> but also, this seems super-specific buuut... Then again #ifdef
> >>> CONFIG_CHECKPOINT_RESTORE around this. Ugh.
> >>>
> >>> I _hate_ this inteface. HATE HATE HATE it.
> >>>
> >>> Anyway, does updating _your own_ auxv really require elevated permissions
> >>> like this?
> >>>
> >>> I don't think so? Couldn't you go and manipulate that anyway without
> >>> elevated anything?
> >>
> >> Hard to believe ...
> >>
> >> I was wondering whether this could break some users. At least CRIU doc
> >> states:
> >>
> >> This option tells *criu* to accept the limitations when running
> >> as non-root. Running as non-root requires *criu* at least to have
> >> *CAP_SYS_ADMIN* or *CAP_CHECKPOINT_RESTORE*. For details about
> >> running *criu* as non-root please consult the *NON-ROOT* section.
> >
> > Hmm. I wonder if we don't have more users than that though? Hard to rule out
> > some weird program somewhere using it for some strange reason.
>
> See my LXC example. My gut feeling is that there are more users.
>
> Which then raises the question why this is still protected by that
> kconfig option.
>
> Something is off here, maybe :)
>
> >
> > Commit ebd6de681238 ("prctl: Allow local CAP_CHECKPOINT_RESTORE to change
> > /proc/self/exe") explicitly _only_ restricted the exe link.
> >
> > So maybe these comment is in reference to _other_ operations other than non-exe
> > changing PR_SET_MM_MAP, PR_SET_MM_MAP_SIZE?
> >
> >>
> >> I mean, the check makes sense given that prctl_set_mm() rejects all
> >> these operations without CAP_SYS_RESOURCE.
> >
> > Hmm but the CAP_SYS_RESOURCE check is only applicable to commands other than
> > PR_SET_MM_MAP or PR_SET_MM_MAP_SIZE?
> >
> > #ifdef CONFIG_CHECKPOINT_RESTORE
> > if (opt == PR_SET_MM_MAP || opt == PR_SET_MM_MAP_SIZE)
> > return prctl_set_mm_map(opt, (const void __user *)addr, arg4);
> > #endif
> >
> > if (!capable(CAP_SYS_RESOURCE))
> > return -EPERM;
> >
> > ... rest ...
>
> My point is that you can perform all these modifications without
> CAP_SYS_RESOURCE through prctl_set_mm_map().
>
> Like PR_SET_MM_AUXV.
>
> It's all very inconsistent, that's what I am saying.
>
> >
> >>
> >>
> >> CAP_CHECKPOINT_RESTORE was not introduced before
> >>
> >> commit 124ea650d3072b005457faed69909221c2905a1f
> >> Author: Adrian Reber <areber@xxxxxxxxxx>
> >> Date: Sun Jul 19 12:04:11 2020 +0200
> >>
> >> capabilities: Introduce CAP_CHECKPOINT_RESTORE
> >>
> >> So at the time PR_SET_MM_MAP was added there simply was no such capability.
> >>
> >> Likely, now that we have it, we should indeed use it.
> >
> > But we did start using it in the exec_fd != -1 case?
>
> The existing ns check was replaced at some point, yes.
>
> >
> > Hmm actually sorry it does more than just manipulating auxv, you can change a
> > bunch of mm->... stuff.
> >
> > But if it's your process does it really matter? You can manipulate memory all
> > over the place in your process...
>
> Well, I am wondering why e.g., PR_SET_MM_AUXV etc requires CAP_SYS_RESOURCE.
>
> PR_SET_MM_EXE_FILE I understand. The other not.
>
> Extremely inconsistent.

It was a long time ago, and I was not directly involved in introducing
PR_SET_MM_MAP. As far as I remember, the PR_SET_MM_{START,END}_ATTR
commands were introduced first, and they were guarded by the global
CAP_SYS_RESOURCE capability. I believe these prctl calls were part of
the first series of patches for user-process Checkpoint/Restore (C/R)
merged into the Linux kernel.

A few years later, as real users started adopting CRIU, there was a demand
to support user namespaces. There was a discussion about relaxing the
global CAP_SYS_RESOURCE check. Eric expressed
concern that this interface didn’t guarantee the consistency of these
parameters. That was the moment PR_SET_MM_MAP was introduced. I dug into
the history and found the relevant discussion:
https://lkml.iu.edu/hypermail/linux/kernel/1407.1/01621.html
https://lkml.iu.edu/hypermail/linux/kernel/1407.1/00899.html.

```
Subject: prctl: PR_SET_MM -- Introduce PR_SET_MM_MAP operation

During development of c/r we've noticed that in case if we need to
support user namespaces we face a problem with capabilities in
prctl(PR_SET_MM, ...) call, in particular once new user namespace
is created capable(CAP_SYS_RESOURCE) no longer passes.

A approach is to eliminate CAP_SYS_RESOURCE check but pass all
new values in one bundle, which would allow the kernel to make
more intensive test for sanity of values and same time allow us to
support checkpoint/restore of user namespaces.
```

The initial implementation of PR_SET_MM_MAP didn't have the capability
check. It was introduced in 4d28df6152aa ("prctl: Allow local
CAP_SYS_ADMIN changing exe_file").

Thanks,
Andrei