Re: [PATCH v3 1/2] fork: add clone3

From: Serge E. Hallyn
Date: Thu Jun 06 2019 - 17:51:01 EST


On Tue, Jun 04, 2019 at 06:09:43PM +0200, Christian Brauner wrote:
> This adds the clone3 system call.
>
> As mentioned several times already (cf. [7], [8]) here's the promised
> patchset for clone3().
>
> We recently merged the CLONE_PIDFD patchset (cf. [1]). It took the last
> free flag from clone().
>
> Independent of the CLONE_PIDFD patchset a time namespace has been discussed
> at Linux Plumber Conference last year and has been sent out and reviewed
> (cf. [5]). It is expected that it will go upstream in the not too distant
> future. However, it relies on the addition of the CLONE_NEWTIME flag to
> clone(). The only other good candidate - CLONE_DETACHED - is currently not
> recyclable as we have identified at least two large or widely used
> codebases that currently pass this flag (cf. [2], [3], and [4]). Given that
> CLONE_PIDFD grabbed the last clone() flag the time namespace is effectively
> blocked. clone3() has the advantage that it will unblock this patchset
> again. In general, clone3() is extensible and allows for the implementation
> of new features.
>
> The idea is to keep clone3() very simple and close to the original clone(),
> specifically, to keep on supporting old clone()-based workloads.
> We know there have been various creative proposals how a new process
> creation syscall or even api is supposed to look like. Some people even
> going so far as to argue that the traditional fork()+exec() split should be
> abandoned in favor of an in-kernel version of spawn(). Independent of
> whether or not we personally think spawn() is a good idea this patchset has
> and does not want to have anything to do with this.
> One stance we take is that there's no real good alternative to
> clone()+exec() and we need and want to support this model going forward;
> independent of spawn().
> The following requirements guided clone3():
> - bump the number of available flags
> - move arguments that are currently passed as separate arguments
> in clone() into a dedicated struct clone_args
> - choose a struct layout that is easy to handle on 32 and on 64 bit
> - choose a struct layout that is extensible
> - give new flags that currently need to abuse another flag's dedicated
> return argument in clone() their own dedicated return argument
> (e.g. CLONE_PIDFD)
> - use a separate kernel internal struct kernel_clone_args that is
> properly typed according to current kernel conventions in fork.c and is
> different from the uapi struct clone_args
> - port _do_fork() to use kernel_clone_args so that all process creation
> syscalls such as fork(), vfork(), clone(), and clone3() behave identical
> (Arnd suggested, that we can probably also port do_fork() itself in a
> separate patchset.)
> - ease of transition for userspace from clone() to clone3()
> This very much means that we do *not* remove functionality that userspace
> currently relies on as the latter is a good way of creating a syscall
> that won't be adopted.
> - do not try to be clever or complex: keep clone3() as dumb as possible
>
> In accordance with Linus suggestions (cf. [11]), clone3() has the following
> signature:
>
> /* uapi */
> struct clone_args {
> __aligned_u64 flags;
> __aligned_u64 pidfd;
> __aligned_u64 child_tid;
> __aligned_u64 parent_tid;
> __aligned_u64 exit_signal;
> __aligned_u64 stack;
> __aligned_u64 stack_size;
> __aligned_u64 tls;
> };
>
> /* kernel internal */
> struct kernel_clone_args {
> u64 flags;
> int __user *pidfd;
> int __user *child_tid;
> int __user *parent_tid;
> int exit_signal;
> unsigned long stack;
> unsigned long stack_size;
> unsigned long tls;
> };
>
> long sys_clone3(struct clone_args __user *uargs, size_t size)
>
> clone3() cleanly supports all of the supported flags from clone() and thus
> all legacy workloads.
> The advantage of sticking close to the old clone() is the low cost for
> userspace to switch to this new api. Quite a lot of userspace apis (e.g.
> pthreads) are based on the clone() syscall. With the new clone3() syscall
> supporting all of the old workloads and opening up the ability to add new
> features should make switching to it for userspace more appealing. In
> essence, glibc can just write a simple wrapper to switch from clone() to
> clone3().
>
> There has been some interest in this patchset already. We have received a
> patch from the CRIU corner for clone3() that would set the PID/TID of a
> restored process without /proc/sys/kernel/ns_last_pid to eliminate a race.
>
> /* User visible differences to legacy clone() */
> - CLONE_DETACHED will cause EINVAL with clone3()
> - CSIGNAL is deprecated
> It is superseeded by a dedicated "exit_signal" argument in struct
> clone_args freeing up space for additional flags.
> This is based on a suggestion from Andrei and Linus (cf. [9] and [10])
>
> /* References */
> [1]: b3e5838252665ee4cfa76b82bdf1198dca81e5be
> [2]: https://dxr.mozilla.org/mozilla-central/source/security/sandbox/linux/SandboxFilter.cpp#343
> [3]: https://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_create.c#n233
> [4]: https://sources.debian.org/src/blcr/0.8.5-2.3/cr_module/cr_dump_self.c/?hl=740#L740
> [5]: https://lore.kernel.org/lkml/20190425161416.26600-1-dima@xxxxxxxxxx/
> [6]: https://lore.kernel.org/lkml/20190425161416.26600-2-dima@xxxxxxxxxx/
> [7]: https://lore.kernel.org/lkml/CAHrFyr5HxpGXA2YrKza-oB-GGwJCqwPfyhD-Y5wbktWZdt0sGQ@xxxxxxxxxxxxxx/
> [8]: https://lore.kernel.org/lkml/20190524102756.qjsjxukuq2f4t6bo@xxxxxxxxxx/
> [9]: https://lore.kernel.org/lkml/20190529222414.GA6492@xxxxxxxxx/
> [10]: https://lore.kernel.org/lkml/CAHk-=whQP-Ykxi=zSYaV9iXsHsENa+2fdj-zYKwyeyed63Lsfw@xxxxxxxxxxxxxx/
> [11]: https://lore.kernel.org/lkml/CAHk-=wieuV4hGwznPsX-8E0G2FKhx3NjZ9X3dTKh5zKd+iqOBw@xxxxxxxxxxxxxx/
>
> Suggested-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Christian Brauner <christian@xxxxxxxxxx>

Acked-by: Serge Hallyn <serge@xxxxxxxxxx>

> Cc: Arnd Bergmann <arnd@xxxxxxxx>
> Cc: Kees Cook <keescook@xxxxxxxxxxxx>
> Cc: Pavel Emelyanov <xemul@xxxxxxxxxxxxx>
> Cc: Jann Horn <jannh@xxxxxxxxxx>
> Cc: David Howells <dhowells@xxxxxxxxxx>
> Cc: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: Adrian Reber <adrian@xxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Andrei Vagin <avagin@xxxxxxxxx>
> Cc: Al Viro <viro@xxxxxxxxxxxxxxxxxx>
> Cc: Florian Weimer <fweimer@xxxxxxxxxx>
> Cc: linux-api@xxxxxxxxxxxxxxx
> ---
> v1:
> - Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>:
> - redesign based on Linus proposal
> - switch from arg-based to revision-based naming scheme: s/clone6/clone3/
> - Arnd Bergmann <arnd@xxxxxxxx>:
> - use a single copy_from_user() instead of multiple get_user() calls
> since the latter have a constant overhead on some architectures
> - a range of other tweaks and suggestions
> v2:
> - Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>,
> Andrei Vagin <avagin@xxxxxxxxx>:
> - replace CSIGNAL flag with dedicated exit_signal argument in struct
> clone_args
> - Christian Brauner <christian@xxxxxxxxxx>:
> - improve naming for some struct clone_args members
> v3:
> - Arnd Bergmann <arnd@xxxxxxxx>:
> - replace memset with constructor for clarity and better object code
> - call flag verification function clone3_flags_valid() on
> kernel_clone_args instead of clone_args
> - remove __ARCH_WANT_SYS_CLONE ifdefine around sys_clone3()
> - Christian Brauner <christian@xxxxxxxxxx>:
> - replace clone3_flags_valid() with clone3_args_valid() and call in
> clone3() directly rather than in copy_clone_args_from_user()
> This cleanly separates copying the args from userspace from the
> verification whether those args are sane.
> - David Howells <dhowells@xxxxxxxxxx>:
> - align new struct member assignments with tabs
> - replace CLONE_MAX by with a non-uapi exported CLONE_LEGACY_FLAGS and
> define it as 0xffffffffULL for clarity
> - make copy_clone_args_from_user() noinline
> - avoid assigning to local variables from struct kernel_clone_args
> members in cases where it makes sense
> ---
> arch/x86/ia32/sys_ia32.c | 12 ++-
> include/linux/sched/task.h | 17 +++-
> include/linux/syscalls.h | 4 +
> include/uapi/linux/sched.h | 16 +++
> kernel/fork.c | 201 ++++++++++++++++++++++++++++---------
> 5 files changed, 199 insertions(+), 51 deletions(-)