Re: [PATCH 0/2] Revert "fs/exec: allow to unshare a time namespace on vfork+exec"

From: Andrei Vagin
Date: Tue Sep 13 2022 - 13:42:37 EST


On Tue, Sep 13, 2022 at 5:18 AM Kees Cook <keescook@xxxxxxxxxxxx> wrote:
>
> On Tue, Sep 13, 2022 at 03:25:49AM -0700, Andrei Vagin wrote:
> > This reverts commits:
> > 133e2d3e81de ("fs/exec: allow to unshare a time namespace on vfork+exec")
> > 6342140db660 ("selftests/timens: add a test for vfork+exit")
> >
> > Alexey pointed out a few undesirable side effects of the reverted change.
> > First, it doesn't take into account that CLONE_VFORK can be used with
> > CLONE_THREAD. Second, a child process doesn't enter a target time name-space,
> > if its parent dies before the child calls exec. It happens because the parent
> > clears vfork_done.
> >
> > Eric W. Biederman suggests installing a time namespace as a task gets a new mm.
> > It includes all new processes cloned without CLONE_VM and all tasks that call
> > exec(). This is an user API change, but we think there aren't users that depend
> > on the old behavior.
>
> Can we include that patch here as well?

It is attached. I need to test it and then I will send it properly.

>
> > It is too late to make such changes in this release, so let's roll back
> > this patch and introduce the right one in the next release.
>
> Do you mean you'd like this revert to land for v6.0, and we should wait
> for the new API for later?

Yes, I mean this. I think we should merge the new patch in v6.1-rc1 so
it sits there for a while.

Thanks,
Andrei
From 23e06b14354b328a100229c2d5c66b1d1ffc7a6f Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@xxxxxxxxx>
Date: Tue, 13 Sep 2022 14:10:28 +0000
Subject: [PATCH] fs/exec: switch timens when a task gets a new mm

Changing a time namespace requires remapping a vvar page, so we don't want to
allow doing that if any other tasks can use the same mm.

Currently, we install a time namespace when a task is cloned without CLONE_VM.
exec() is another case when a task gets a new mm and so it can switch a
time namespace safely, but this case isn't handled now.

One more issue of the current interface is that clone() with CLONE_VM isn't
allowed if the current task has unshared a time namespace (timens_for_children
doesn't match the current timens).

Both these issues make some inconvenience for users. For example, Florian
reported that posix_spawn() uses vfork+exec and this pattern doesn't work well
with time namespaces due to the both described issues. LXC needed to workaround
the exec() issue by calling setns.

In 133e2d3e81de5 "fs/exec: allow to unshare a time namespace on vfork+exec", we
tried to fix these issues with minimal impact on UAPI. But it adds extra
complexity and some undesirable side effects. Eric suggested fixing the issues
properly because here are all the reasons to suppose that there are no users
that depend on the old behavior.

Cc: Alexey Izbyshev <izbyshev@xxxxxxxxx>
Cc: Christian Brauner <brauner@xxxxxxxxxx>
Cc: Dmitry Safonov <0x7f454c46@xxxxxxxxx>
Cc: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
Cc: Florian Weimer <fweimer@xxxxxxxxxx>
Cc: Kees Cook <keescook@xxxxxxxxxxxx>
Suggested-by: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
Original-author: "Eric W. Biederman" <ebiederm@xxxxxxxxxxxx>
Signed-off-by: Andrei Vagin <avagin@xxxxxxxxx>
---
fs/exec.c | 5 +++++
include/linux/nsproxy.h | 1 +
kernel/fork.c | 9 ---------
kernel/nsproxy.c | 23 +++++++++++++++++++++--
4 files changed, 27 insertions(+), 11 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index d046dbb9cbd0..71284188b96d 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -65,6 +65,7 @@
#include <linux/io_uring.h>
#include <linux/syscall_user_dispatch.h>
#include <linux/coredump.h>
+#include <linux/time_namespace.h>

#include <linux/uaccess.h>
#include <asm/mmu_context.h>
@@ -1296,6 +1297,10 @@ int begin_new_exec(struct linux_binprm * bprm)

bprm->mm = NULL;

+ retval = exec_task_namespaces();
+ if (retval)
+ goto out_unlock;
+
#ifdef CONFIG_POSIX_TIMERS
spin_lock_irq(&me->sighand->siglock);
posix_cpu_timers_exit(me);
diff --git a/include/linux/nsproxy.h b/include/linux/nsproxy.h
index cdb171efc7cb..fee881cded01 100644
--- a/include/linux/nsproxy.h
+++ b/include/linux/nsproxy.h
@@ -94,6 +94,7 @@ static inline struct cred *nsset_cred(struct nsset *set)
int copy_namespaces(unsigned long flags, struct task_struct *tsk);
void exit_task_namespaces(struct task_struct *tsk);
void switch_task_namespaces(struct task_struct *tsk, struct nsproxy *new);
+int exec_task_namespaces(void);
void free_nsproxy(struct nsproxy *ns);
int unshare_nsproxy_namespaces(unsigned long, struct nsproxy **,
struct cred *, struct fs_struct *);
diff --git a/kernel/fork.c b/kernel/fork.c
index 2b6bd511c6ed..4eb803f75225 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2044,15 +2044,6 @@ static __latent_entropy struct task_struct *copy_process(
return ERR_PTR(-EINVAL);
}

- /*
- * If the new process will be in a different time namespace
- * do not allow it to share VM or a thread group with the forking task.
- */
- if (clone_flags & (CLONE_THREAD | CLONE_VM)) {
- if (nsp->time_ns != nsp->time_ns_for_children)
- return ERR_PTR(-EINVAL);
- }
-
if (clone_flags & CLONE_PIDFD) {
/*
* - CLONE_DETACHED is blocked so that we can potentially
diff --git a/kernel/nsproxy.c b/kernel/nsproxy.c
index eec72ca962e2..a487ff24129b 100644
--- a/kernel/nsproxy.c
+++ b/kernel/nsproxy.c
@@ -157,7 +157,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (likely(!(flags & (CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
CLONE_NEWPID | CLONE_NEWNET |
CLONE_NEWCGROUP | CLONE_NEWTIME)))) {
- if (likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
+ if ((flags & CLONE_VM) ||
+ likely(old_ns->time_ns_for_children == old_ns->time_ns)) {
get_nsproxy(old_ns);
return 0;
}
@@ -179,7 +180,8 @@ int copy_namespaces(unsigned long flags, struct task_struct *tsk)
if (IS_ERR(new_ns))
return PTR_ERR(new_ns);

- timens_on_fork(new_ns, tsk);
+ if ((flags & CLONE_VM) == 0)
+ timens_on_fork(new_ns, tsk);

tsk->nsproxy = new_ns;
return 0;
@@ -254,6 +256,23 @@ void exit_task_namespaces(struct task_struct *p)
switch_task_namespaces(p, NULL);
}

+int exec_task_namespaces(void)
+{
+ struct task_struct *tsk = current;
+ struct nsproxy *new;
+
+ if (tsk->nsproxy->time_ns_for_children == tsk->nsproxy->time_ns)
+ return 0;
+
+ new = create_new_namespaces(0, tsk, current_user_ns(), tsk->fs);
+ if (IS_ERR(new))
+ return PTR_ERR(new);
+
+ timens_on_fork(new, tsk);
+ switch_task_namespaces(tsk, new);
+ return 0;
+}
+
static int check_setns_flags(unsigned long flags)
{
if (!flags || (flags & ~(CLONE_NEWNS | CLONE_NEWUTS | CLONE_NEWIPC |
--
2.37.2.789.g6183377224-goog