[PATCH 1/2] fork: add clone6

From: Christian Brauner
Date: Sun May 26 2019 - 06:31:58 EST


This adds the clone6 system call.

As mentioned several times already (cf. [7], [8]) here's the promised
patchset for clone6().

We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
free flag from clone().

Independent of the CLONE_PIDFD patchset a time namespace has been discussed
at Linux Plumber Conference last year and has been sent out and reviewed
(cf. [5]). It is expected that it will go upstream in the not too distant
future. However, it relies on the addition of the CLONE_NEWTIME flag to
clone(). The only other good candidate - CLONE_DETACHED - is currently not
recycable as we have identified at least two large or widely used codebases
that currently pass this flag (cf. [2], [3], and [4]). Given that we
grabbed the last clone() flag we effectively blocked the time namespace
patchset. It just seems right that we unblock it again.

The idea is to keep clone6() very simple and close to the original clone(),
specifically, to keep on supporting old clone()-based workloads.
We know there have been various creative proposals how a new process
creation syscall or even api is supposed to look like. Some people even
going so far as to argue that the traditional fork()+exec() split should be
abandoned in favor of an in-kernel version of spawn(). Independent of
whether or not we personally think spawn() is a good idea this patchset has
and does not want to have anything to do with this.
One stance we take is that there's no real good alternative to
clone()+exec() and we need and want to support this model going forward
independent of spawn().
The following requirements guided us for clone6():
- bump the number of available flags as much as possible while ensuring
that all flag arguments are passed in registers so they remain easily
accessible for seccomp.
- move non-flag arguments that are currently passed as separate arguments
in clone() into a dedicated struct
- choose a struct layout that is easy to handle on 32 and on 64 bit
- choose a struct layout that is extensible
- give new flags that currently need to abuse another flag's dedicated
return argument in clone() their own dedicated return argument
(e.g. CLONE_PIDFD)
- ease of transition for userspace from clone() to clone6()
- do not try to be clever or complex: keep clone6() as dumb as possible

What we came up with is clone6() which has the following signature:

struct clone6_args {
__s32 pidfd;
__aligned_u64 parent_tidptr;
__aligned_u64 child_tidptr;
__aligned_u64 stack;
__aligned_u64 stack_size;
__aligned_u64 tls;
};

long sys_clone6(struct clone6_args __user *uargs,
unsigned int flags1,
unsigned int flags2,
unsigned int flags3,
unsigned int flags4,
unsigned int flags5);

clone6() cleanly supports all of the supported flags from clone() in
flags1, i.e. flags1 is full and all legacy workloads are supported with
clone6().
With clone6() we have 160 flag values in total which - even for a feature
heavy syscall like clone - should hold quite a while. If they are really
all taken at some point we can simply bite the bullet and start adding
additional flag arguments into struct clone6 itself.

Another advantage of sticking close to the old clone() is the low cost for
userspace to switch to this new api. Quite a lot of userspace apis (e.g.
pthreads) are based on the clone() syscall. With the new clone6() syscall
supporting all of the old workloads and opening up the ability to add new
features should make switching to it for userspace more appealing. In
essence, glibc can just write a simple wrapper to switch from clone() to
clone6().

There has been some interest in this patchset already. We have received a
patch from the CRIU corner for clone6() that would set the PID/TID of a
restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.

/* References */
[1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
[2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
[3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
[4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
[5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@xxxxxxxxxx/
[6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@xxxxxxxxxx/
[7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@xxxxxxxxxxxxxx/
[8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@xxxxxxxxxx/

Co-developed-by: Jann Horn <jannh@xxxxxxxxxx>
Signed-off-by: Jann Horn <jannh@xxxxxxxxxx>
Signed-off-by: Christian Brauner <christian@xxxxxxxxxx>
Cc: Arnd Bergmann <arnd@xxxxxxxx>
Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
Cc: David Howells <dhowells@xxxxxxxxxx>
Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
Cc: Adrian Reber <adrian@xxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Andrei Vagin <avagin@xxxxxxxxx>
Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
Cc: Florian Weimer <fweimer@xxxxxxxxxx>
Cc: linux-api@xxxxxxxxxxxxxxx
---
arch/x86/ia32/sys_ia32.c | 9 ++-
include/linux/sched/task.h | 2 +-
include/linux/syscalls.h | 9 +++
include/uapi/linux/sched.h | 14 ++++
kernel/fork.c | 161 ++++++++++++++++++++++++++++---------
5 files changed, 152 insertions(+), 43 deletions(-)

diff --git a/arch/x86/ia32/sys_ia32.c b/arch/x86/ia32/sys_ia32.c
index a43212036257..55a8c550ba74 100644
--- a/arch/x86/ia32/sys_ia32.c
+++ b/arch/x86/ia32/sys_ia32.c
@@ -237,6 +237,11 @@ COMPAT_SYSCALL_DEFINE5(x86_clone, unsigned long, clone_flags,
unsigned long, newsp, int __user *, parent_tidptr,
unsigned long, tls_val, int __user *, child_tidptr)
{
- return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr,
- tls_val);
+ struct clone6_args args = {
+ .stack = newsp,
+ .parent_tidptr = (uintptr_t)parent_tidptr,
+ .tls = tls_val,
+ .child_tidptr = (uintptr_t)child_tidptr
+ };
+ return _do_fork(clone_flags, &args);
}
diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index f1227f2c38a4..06e7c0df6ab0 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -73,7 +73,7 @@ extern void do_group_exit(int);
extern void exit_files(struct task_struct *);
extern void exit_itimers(struct signal_struct *);

-extern long _do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *, unsigned long);
+extern long _do_fork(u64 clone_flags, struct clone6_args *uargs);
extern long do_fork(unsigned long, unsigned long, unsigned long, int __user *, int __user *);
struct task_struct *fork_idle(int);
struct mm_struct *copy_init_mm(void);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index e2870fe1be5b..235df5c5e711 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -70,6 +70,7 @@ struct sigaltstack;
struct rseq;
union bpf_attr;
struct io_uring_params;
+struct clone6_args;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -852,6 +853,14 @@ asmlinkage long sys_clone(unsigned long, unsigned long, int __user *,
int __user *, unsigned long);
#endif
#endif
+
+#ifdef __ARCH_WANT_SYS_CLONE
+asmlinkage long sys_clone6(struct clone6_args __user *uargs,
+ unsigned int flags1, unsigned int flags2,
+ unsigned int flags3, unsigned int flags4,
+ unsigned int flags5);
+#endif
+
asmlinkage long sys_execve(const char __user *filename,
const char __user *const __user *argv,
const char __user *const __user *envp);
diff --git a/include/uapi/linux/sched.h b/include/uapi/linux/sched.h
index ed4ee170bee2..b8d2809c5bc6 100644
--- a/include/uapi/linux/sched.h
+++ b/include/uapi/linux/sched.h
@@ -2,6 +2,8 @@
#ifndef _UAPI_LINUX_SCHED_H
#define _UAPI_LINUX_SCHED_H

+#include <linux/types.h>
+
/*
* cloning flags:
*/
@@ -31,6 +33,18 @@
#define CLONE_NEWNET 0x40000000 /* New network namespace */
#define CLONE_IO 0x80000000 /* Clone io context */

+/*
+ * Arguments for the clone6 syscall
+ */
+struct clone6_args {
+ __s32 pidfd;
+ __aligned_u64 parent_tidptr;
+ __aligned_u64 child_tidptr;
+ __aligned_u64 stack;
+ __aligned_u64 stack_size;
+ __aligned_u64 tls;
+};
+
/*
* Scheduling policies
*/
diff --git a/kernel/fork.c b/kernel/fork.c
index b4cba953040a..ffd5471c64af 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1760,19 +1760,21 @@ static __always_inline void delayed_free_task(struct task_struct *tsk)
* flags). The actual kick-off is left to the caller.
*/
static __latent_entropy struct task_struct *copy_process(
- unsigned long clone_flags,
- unsigned long stack_start,
- unsigned long stack_size,
- int __user *parent_tidptr,
- int __user *child_tidptr,
+ u64 clone_flags,
struct pid *pid,
int trace,
- unsigned long tls,
- int node)
+ int node,
+ struct clone6_args *args,
+ bool is_clone6)
{
int pidfd = -1, retval;
struct task_struct *p;
struct multiprocess_signals delayed;
+ int __user *child_tidptr = u64_to_user_ptr(args->child_tidptr);
+ int __user *parent_tidptr = NULL;
+ unsigned long tls = args->tls;
+ unsigned long stack_start = args->stack;
+ unsigned long stack_size = args->stack_size;

/*
* Don't allow sharing the root directory with processes in a different
@@ -1824,25 +1826,32 @@ static __latent_entropy struct task_struct *copy_process(
int reserved;

/*
- * - CLONE_PARENT_SETTID is useless for pidfds and also
- * parent_tidptr is used to return pidfds.
* - CLONE_DETACHED is blocked so that we can potentially
* reuse it later for CLONE_PIDFD.
* - CLONE_THREAD is blocked until someone really needs it.
*/
- if (clone_flags &
- (CLONE_DETACHED | CLONE_PARENT_SETTID | CLONE_THREAD))
+ if (clone_flags & (CLONE_DETACHED | CLONE_THREAD))
return ERR_PTR(-EINVAL);

- /*
- * Verify that parent_tidptr is sane so we can potentially
- * reuse it later.
- */
- if (get_user(reserved, parent_tidptr))
- return ERR_PTR(-EFAULT);
+ if (!is_clone6) {
+ /*
+ * For non-clone6() versions parent_tidptr is used to
+ * return pidfds.
+ */
+ if (clone_flags & CLONE_PARENT_SETTID)
+ return ERR_PTR(-EINVAL);

- if (reserved != 0)
- return ERR_PTR(-EINVAL);
+ /*
+ * Verify that parent_tidptr is sane so we can
+ * potentially reuse it later.
+ */
+ parent_tidptr = u64_to_user_ptr(args->parent_tidptr);
+ if (get_user(reserved, parent_tidptr))
+ return ERR_PTR(-EFAULT);
+
+ if (reserved != 0)
+ return ERR_PTR(-EINVAL);
+ }
}

/*
@@ -2062,9 +2071,14 @@ static __latent_entropy struct task_struct *copy_process(
goto bad_fork_free_pid;

pidfd = retval;
- retval = put_user(pidfd, parent_tidptr);
- if (retval)
- goto bad_fork_put_pidfd;
+ if (!is_clone6) {
+ /* store pidfd in parent_tidptr for legacy clone */
+ retval = put_user(pidfd, parent_tidptr);
+ if (retval)
+ goto bad_fork_put_pidfd;
+ } else {
+ args->pidfd = pidfd;
+ }
}

#ifdef CONFIG_BLOCK
@@ -2313,8 +2327,10 @@ static inline void init_idle_pids(struct task_struct *idle)
struct task_struct *fork_idle(int cpu)
{
struct task_struct *task;
- task = copy_process(CLONE_VM, 0, 0, NULL, NULL, &init_struct_pid, 0, 0,
- cpu_to_node(cpu));
+ struct clone6_args args = { 0 };
+
+ task = copy_process(CLONE_VM, &init_struct_pid, 0,
+ cpu_to_node(cpu), &args, false);
if (!IS_ERR(task)) {
init_idle_pids(task);
init_idle(task, cpu);
@@ -2334,18 +2350,15 @@ struct mm_struct *copy_init_mm(void)
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*/
-long _do_fork(unsigned long clone_flags,
- unsigned long stack_start,
- unsigned long stack_size,
- int __user *parent_tidptr,
- int __user *child_tidptr,
- unsigned long tls)
+static long _do_clone_common(u64 clone_flags, struct clone6_args *args,
+ bool is_clone6)
{
struct completion vfork;
struct pid *pid;
struct task_struct *p;
int trace = 0;
long nr;
+ int __user *parent_tidptr = u64_to_user_ptr(args->parent_tidptr);

/*
* Determine whether and which event to report to ptracer. When
@@ -2365,8 +2378,8 @@ long _do_fork(unsigned long clone_flags,
trace = 0;
}

- p = copy_process(clone_flags, stack_start, stack_size, parent_tidptr,
- child_tidptr, NULL, trace, tls, NUMA_NO_NODE);
+ p = copy_process(clone_flags, NULL, trace, NUMA_NO_NODE, args,
+ is_clone6);
add_latent_entropy();

if (IS_ERR(p))
@@ -2405,6 +2418,11 @@ long _do_fork(unsigned long clone_flags,
return nr;
}

+long _do_fork(u64 clone_flags, struct clone6_args *args)
+{
+ return _do_clone_common(clone_flags, args, false);
+}
+
#ifndef CONFIG_HAVE_COPY_THREAD_TLS
/* For compatibility with architectures that call do_fork directly rather than
* using the syscall entry points below. */
@@ -2414,8 +2432,14 @@ long do_fork(unsigned long clone_flags,
int __user *parent_tidptr,
int __user *child_tidptr)
{
- return _do_fork(clone_flags, stack_start, stack_size,
- parent_tidptr, child_tidptr, 0);
+ struct clone6_args args = {
+ .stack = stack_start,
+ .stack_size = stack_size,
+ .parent_tidptr = (uintptr_t)parent_tidptr,
+ .child_tidptr = (uintptr_t)child_tidptr
+ };
+
+ return _do_fork(clone_flags, &args);
}
#endif

@@ -2424,15 +2448,20 @@ long do_fork(unsigned long clone_flags,
*/
pid_t kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
- return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, (unsigned long)fn,
- (unsigned long)arg, NULL, NULL, 0);
+ struct clone6_args args = {
+ .stack = (unsigned long)fn,
+ .stack_size = (unsigned long)arg,
+ };
+ return _do_fork(flags|CLONE_VM|CLONE_UNTRACED, &args);
}

#ifdef __ARCH_WANT_SYS_FORK
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
- return _do_fork(SIGCHLD, 0, 0, NULL, NULL, 0);
+ struct clone6_args args = { 0 };
+
+ return _do_fork(SIGCHLD, &args);
#else
/* can not support in nommu mode */
return -EINVAL;
@@ -2443,8 +2472,9 @@ SYSCALL_DEFINE0(fork)
#ifdef __ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
- return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, 0,
- 0, NULL, NULL, 0);
+ struct clone6_args args = { 0 };
+
+ return _do_fork(CLONE_VFORK | CLONE_VM | SIGCHLD, &args);
}
#endif

@@ -2472,7 +2502,58 @@ SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
unsigned long, tls)
#endif
{
- return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);
+ struct clone6_args args = {
+ .stack = newsp,
+ .parent_tidptr = (uintptr_t)parent_tidptr,
+ .tls = tls,
+ .child_tidptr = (uintptr_t)child_tidptr
+ };
+ return _do_fork(clone_flags, &args);
+}
+
+SYSCALL_DEFINE6(clone6, struct clone6_args __user*, uargs,
+ unsigned int, flags1,
+ unsigned int, flags2,
+ unsigned int, flags3,
+ unsigned int, flags4,
+ unsigned int, flags5)
+{
+ struct clone6_args args = { 0 };
+ u64 flags = flags1;
+ int result;
+
+ if (flags2 || flags3 || flags4 || flags5)
+ return -EINVAL;
+
+ /*
+ * flags1 is full so we only need to verify that CLONE_DETACHED
+ * is not passed since we can't use it.
+ */
+ if (flags & CLONE_DETACHED)
+ return -EINVAL;
+
+ result = get_user(args.stack, &uargs->stack);
+#if defined(CONFIG_CLONE_BACKWARDS3)
+ if (!result)
+ result = get_user(args.stack_size, &uargs->stack_size);
+#endif
+ if (!result && (flags & (CLONE_CHILD_CLEARTID | CLONE_CHILD_SETTID)))
+ result = get_user(args.child_tidptr, &uargs->child_tidptr);
+ if (!result && (flags & CLONE_PARENT_SETTID))
+ result = get_user(args.parent_tidptr, &uargs->parent_tidptr);
+ if (!result && (flags & CLONE_SETTLS))
+ result = get_user(args.tls, &uargs->tls);
+ if (result)
+ return -EFAULT;
+
+ result = _do_clone_common(flags, &args, true);
+
+ if (flags & CLONE_PIDFD) {
+ if (put_user(args.pidfd, &uargs->pidfd))
+ return -EFAULT;
+ }
+
+ return result;
}
#endif

--
2.21.0