[GIT PULL] checkpoint/restore changes for v5.9

From: Christian Brauner
Date: Tue Aug 04 2020 - 07:32:16 EST

Hi Linus,

/* Summary */
This pull request contains the changes to enable unprivileged
checkpoint/restore of processes. Given that this work has been going on for
quite some time the first sentence in this summary is hopefully more exciting
than the actual final code changes required.
Unprivileged checkpoint/restore has seen a frequent increase in interest over
the last two years and has thus been one of the main topics for the combined
containers & checkpoint/restore microconference since at least 2018 (cf. [1]).

Here are just the three most frequent use-cases that were brought forward:
1. The JVM developers are integrating checkpoint/restore into a Java VM to
significantly decrease the startup time.
2. In high-performance computing environment a resource manager will typically
be distributing jobs where users are always running as non-root.
Long-running and "large" processes with significant startup times are
supposed to be checkpointed and restored with CRIU.
3. Container migration as a non-root user.

In all of these scenarios it is either desirable or required to run without
CAP_SYS_ADMIN. The userspace implementation of checkpoint/restore CRIU already
has the pull request for supporting unprivileged checkpoint/restore up (cf. [2]).

To enable unprivileged checkpoint/restore a new dedicated capability
CAP_CHECKPOINT_RESTORE is introduced. This solution has last been discussed in
2019 in a talk by Google at Linux Plumbers (cf. [1] "Update on Task Migration
at Google Using CRIU") with Adrian and Nicolas providing the implementation now
over the last months. In essence, this allows the CRIU binary to be installed
with the CAP_CHECKPOINT_RESTORE vfs capability set thereby enabling
unprivileged users to restore processes.

To make this possible the following permissions are altered:
1. Selecting a specific PID via clone3() set_tid
2. Selecting a specific PID via /proc/sys/kernel/ns_last_pid
3. Accessing /proc/pid/map_files
- Relaxed from init userns CAP_SYS_ADMIN to init userns
4. Changing /proc/self/exe
- Relaxed from userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

Of these four changes the /proc/self/exe change deserves a few words because
the reasoning behind even restricting /proc/self/exe changes in the first place
is just full of historical quirks and tracking this down was a questionable
version of fun that I'd like to spare others.
In short, it is trivial to change /proc/self/exe as an unprivileged user, i.e.
without userns CAP_SYS_ADMIN right now. Either via ptrace() or by simply
intercepting the elf loader in userspace during exec. Nicolas was nice enough
to even provide a POC for the latter (cf. [3]) to illustrate this fact.
The original patchset which introduced PR_SET_MM_MAP had no permissions around
changing the exe link. They too argued that it is trivial to spoof the exe link
already which is true. The argument brought up against this was that the Tomoyo
LSM uses the exe link in tomoyo_manager() to detect whether the calling process
is a policy manager. This caused changing the exe links to be guarded by userns
CAP_SYS_ADMIN. All in all this rather seems like a "better guard it with
something rather than nothing" argument which imho doesn't qualify as a great
security policy. Again, because spoofing the exe link is possible for the
calling process so even if this were security relevant it was broken back then
and would be broken today. So technically, dropping all permissions around
changing the exe link would probably be possible and would send a clearer
message to any userspace that relies on /proc/self/exe for security reasons
that they should stop doing this but for now we're only relaxing the exe link
permissions from userns CAP_SYS_ADMIN to userns CAP_CHECKPOINT_RESTORE.

There's a final uapi change in here. Changing the exe link used to accidently
return EINVAL when the caller lacked the necessary permissions instead of
the more correct EPERM. This pr contains a commit fixing this. I assume that
userspace won't notice or care and if they do I will revert this commit. But
since we are changing the permissions anyway it seems like a good opportunity
to try this fix.

With these changes merged unprivileged checkpoint/restore will be possible and
has already been tested by various users.

[1]: LPC 2018
1. "Task Migration at Google Using CRIU"
2. "Securely Migrating Untrusted Workloads with CRIU"
LPC 2019
1. "CRIU and the PID dance"
2. "Update on Task Migration at Google Using CRIU"
[2]: https://github.com/checkpoint-restore/criu/pull/1155
[3]: https://github.com/nviennot/run_as_exe

/* Testing */
All patches are based on v5.8-rc5 and have been sitting in linux-next. No build
failures or warnings were observed. All old and new tests are passing.

/* Conflicts */
At the time of creating this PR no merge conflicts were reported from
linux-next and no merge conflicts showed up doing a test-merge with current
mainline bcf876870b95 ("Linux 5.8").

The following changes since commit 11ba468877bb23f28956a35e896356252d63c983:

Linux 5.8-rc5 (2020-07-12 16:34:50 -0700)

are available in the Git repository at:

git@xxxxxxxxxxxxxxxxxxx:pub/scm/linux/kernel/git/brauner/linux tags/cap-checkpoint-restore-v5.9

for you to fetch changes up to 1d27a0be16d6c95fd71deef34e94b40cb4411cc9:

selftests: add clone3() CAP_CHECKPOINT_RESTORE test (2020-07-20 15:47:42 +0200)

Please consider pulling these changes from the signed cap-checkpoint-restore-v5.9 tag.



Adrian Reber (5):
capabilities: Introduce CAP_CHECKPOINT_RESTORE
pid: use checkpoint_restore_ns_capable() for set_tid
pid_namespace: use checkpoint_restore_ns_capable() for ns_last_pid
proc: allow access in init userns for map_files with CAP_CHECKPOINT_RESTORE
selftests: add clone3() CAP_CHECKPOINT_RESTORE test

Nicolas Viennot (2):
prctl: Allow local CAP_CHECKPOINT_RESTORE to change /proc/self/exe
prctl: exe link permission error changed from -EINVAL to -EPERM

fs/proc/base.c | 8 +-
include/linux/capability.h | 6 +
include/uapi/linux/capability.h | 9 +-
kernel/pid.c | 2 +-
kernel/pid_namespace.c | 2 +-
kernel/sys.c | 13 +-
security/selinux/include/classmap.h | 5 +-
tools/testing/selftests/clone3/.gitignore | 1 +
tools/testing/selftests/clone3/Makefile | 4 +-
.../clone3/clone3_cap_checkpoint_restore.c | 182 +++++++++++++++++++++
10 files changed, 217 insertions(+), 15 deletions(-)
create mode 100644 tools/testing/selftests/clone3/clone3_cap_checkpoint_restore.c