Re: [PATCH v5 3/6] pidfd: add CLONE_PIDFD_AUTOKILL
From: Jann Horn
Date: Mon Mar 02 2026 - 12:17:17 EST
On Thu, Feb 26, 2026 at 2:51 PM Christian Brauner <brauner@xxxxxxxxxx> wrote:
> Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> lifetime to the pidfd returned from clone3(). When the last reference to
> the struct file created by clone3() is closed the kernel sends SIGKILL
> to the child. A pidfd obtained via pidfd_open() for the same process
> does not keep the child alive and does not trigger autokill - only the
> specific struct file from clone3() has this property.
>
> This is useful for container runtimes, service managers, and sandboxed
> subprocess execution - any scenario where the child must die if the
> parent crashes or abandons the pidfd.
>
> CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD (the whole point is tying
> lifetime to the pidfd file) and CLONE_AUTOREAP (a killed child with no
> one to reap it would become a zombie). CLONE_THREAD is rejected because
> autokill targets a process not a thread.
>
> The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on
> the struct file at clone3() time. The pidfs .release handler checks this
> flag and sends SIGKILL via do_send_sig_info(SIGKILL, SEND_SIG_PRIV, ...)
> only when it is set. Files from pidfd_open() or open_by_handle_at() are
> distinct struct files that do not carry this flag. dup()/fork() share the
> same struct file so they extend the child's lifetime until the last
> reference drops.
>
> CLONE_PIDFD_AUTOKILL uses a privilege model based on CLONE_NNP: without
> CLONE_NNP the child could escalate privileges via setuid/setgid exec
> after being spawned, so the caller must have CAP_SYS_ADMIN in its user
> namespace. With CLONE_NNP the child can never gain new privileges so
> unprivileged usage is allowed. This is a deliberate departure from the
> pdeath_signal model which is reset during secureexec and commit_creds()
> rendering it useless for container runtimes that need to deprivilege
> themselves.
[...]
> diff --git a/kernel/fork.c b/kernel/fork.c
> index a3202ee278d8..0f4944ce378d 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -2042,6 +2042,24 @@ __latent_entropy struct task_struct *copy_process(
> return ERR_PTR(-EINVAL);
> }
>
> + if (clone_flags & CLONE_PIDFD_AUTOKILL) {
> + if (!(clone_flags & CLONE_PIDFD))
> + return ERR_PTR(-EINVAL);
> + if (!(clone_flags & CLONE_AUTOREAP))
> + return ERR_PTR(-EINVAL);
> + if (clone_flags & CLONE_THREAD)
> + return ERR_PTR(-EINVAL);
> + /*
> + * Without CLONE_NNP the child could escalate privileges
> + * after being spawned, so require CAP_SYS_ADMIN.
> + * With CLONE_NNP the child can't gain new privileges,
> + * so allow unprivileged usage.
> + */
> + if (!(clone_flags & CLONE_NNP) &&
> + !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> + return ERR_PTR(-EPERM);
> + }
That security model looks good to me.