Re: [PATCH v7 1/2] fork: extend clone3() to support setting a PID

From: Rasmus Villemoes
Date: Mon Nov 11 2019 - 15:41:45 EST


On 11/11/2019 14.17, Adrian Reber wrote:
> The main motivation to add set_tid to clone3() is CRIU.
>
> To restore a process with the same PID/TID CRIU currently uses
> /proc/sys/kernel/ns_last_pid. It writes the desired (PID - 1) to
> ns_last_pid and then (quickly) does a clone(). This works most of the
> time, but it is racy. It is also slow as it requires multiple syscalls.
>
> Extending clone3() to support *set_tid makes it possible restore a
> process using CRIU without accessing /proc/sys/kernel/ns_last_pid and
> race free (as long as the desired PID/TID is available).
>
> This clone3() extension places the same restrictions (CAP_SYS_ADMIN)
> on clone3() with *set_tid as they are currently in place for ns_last_pid.
>
> The original version of this change was using a single value for
> set_tid. At the 2019 LPC, after presenting set_tid, it was, however,
> decided to change set_tid to an array to enable setting the PID of a
> process in multiple PID namespaces at the same time. If a process is
> created in a PID namespace it is possible to influence the PID inside
> and outside of the PID namespace. Details also in the corresponding
> selftest.
>

> /*
> * Verify that higher 32bits of exit_signal are unset and that
> * it is a valid signal
> @@ -2556,8 +2561,17 @@ noinline static int copy_clone_args_from_user(struct kernel_clone_args *kargs,
> .stack = args.stack,
> .stack_size = args.stack_size,
> .tls = args.tls,
> + .set_tid = kargs->set_tid,
> + .set_tid_size = args.set_tid_size,
> };

This is a bit ugly. And is it even well-defined? I mean, it's a bit
similar to the "i = i++;". So it would be best to avoid.

> + for (i = 0; i < args.set_tid_size; i++) {
> + if (copy_from_user(&kargs->set_tid[i],
> + u64_to_user_ptr(args.set_tid + (i * sizeof(args.set_tid))),
> + sizeof(pid_t)))
> + return -EFAULT;
> + }
> +

If I'm reading this (and your test case) right, you expect the user
pointer to point at an array of u64, and here you're copying the first
half of each u64 to the pid_t array. That only works on little-endian.

It seems more obvious (since I don't think there's any disagreement
anywhere on sizeof(pid_t)) to expect the user pointer to point at an
array of pid_t and then simply copy_from_user() the whole thing in one go.

> return 0;
> }
>
> @@ -2631,6 +2645,10 @@ SYSCALL_DEFINE2(clone3, struct clone_args __user *, uargs, size_t, size)
> int err;
>
> struct kernel_clone_args kargs;
> + pid_t set_tid[MAX_PID_NS_LEVEL];
> +
> + memset(set_tid, 0, sizeof(set_tid));
> + kargs.set_tid = set_tid;

Hm, isn't it a bit much to add two cachelines (and dirtying them via the
memset) to the stack footprint of clone3, considering that almost nobody
(relatively speaking) will use this?

So how about copy_clone_args_from_user() does

if (args.set_tid) {
set_tid = memdup_user(u64_to_user_ptr(), ...)
if (IS_ERR(set_tid))
return PTR_ERR(set_tid);
kargs.set_tid = set_tid;
}

Then somebody needs to free that, but this is probably not the last
clone extension that might need extra data, so one could do

s/long _do_fork/static long __do_fork/

and then create a _do_fork that always cleans up the passed-in kargs, i.e.

long _do_fork(struct kargs *args)
{
long ret = __do_fork(args);
kfree(args->set_tid);
return ret;
}

Rasmus