Re: [PATCH RFC v3 2/4] pidfd: add CLONE_PIDFD_AUTOKILL

From: Christian Brauner

Date: Wed Feb 18 2026 - 05:27:01 EST

On Wed, Feb 18, 2026 at 12:38:02AM +0100, Jann Horn wrote:
> On Wed, Feb 18, 2026 at 12:18 AM Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> > On Tue, 17 Feb 2026 at 14:36, Christian Brauner <brauner@xxxxxxxxxx> wrote:
> > >
> > > Add a new clone3() flag CLONE_PIDFD_AUTOKILL that ties a child's
> > > lifetime to the pidfd returned from clone3(). When the last reference to
> > > the struct file created by clone3() is closed the kernel sends SIGKILL
> > > to the child.
> >
> > Did I read this right? You can now basically kill suid binaries that
> > you started but don't have rights to kill any other way.
> >
> > If I'm right, this is completely broken. Please explain.
>
> You can already send SIGHUP to such binaries through things like job
> control, right?
> Do we know if there are setuid binaries out there that change their
> ruid and suid to prevent being killable via kill_ok_by_cred(), then
> set SIGHUP to SIG_IGN to not be killable via job control, and then do
> some work that shouldn't be interrupted?
>
> Also, on a Linux system with systemd, I believe a normal user, when
> running in the context of a user session (but not when running in the
> context of a system service), can already SIGKILL anything they launch
> by launching it in a systemd user service, then doing something like
> "echo 1 > /sys/fs/cgroup/user.slice/user-$UID.slice/user@$UID.service/app.slice/<servicename>.scope/cgroup.kill"
> because systemd delegates cgroups for anything a user runs to that
> user; and cgroup.kill goes through the codepath
> cgroup_kill_write -> cgroup_kill -> __cgroup_kill -> send_sig(SIGKILL,
> task, 0) -> send_sig_info -> do_send_sig_info
> which, as far as I know, bypasses the normal signal sending permission
> checks. (For comparison, group_send_sig_info() first calls
> check_kill_permission(), then do_send_sig_info().)
>
> I agree that this would be a change to the security model, but I'm not
> sure if it would be that big a change. I guess an alternative might be
> to instead gate the clone() flag on a `task_no_new_privs(current) ||
> ns_capable()` check like in seccomp, but that might be too restrictive
> for the usecases Christian has in mind...

So I'm going to briefly reiterate what I wrote in my other replies because
I really don't want to get anyone the impression that I don't understand
that this is a change in the security model - It's what I explicitly
wanted to discuss:

I'm very aware that as written this will allow users to kill setuid
binaries. I explictly wrote the first RFC so autokill isn't reset during
bprm->secureexec nor during commit_creds() - in contrast to pdeath
signal.

I did indeed think of simply using the seccomp model. I have a long
document about all of the different implications for all of this.

Ideally we'd not have to use the seccomp model but if we have to I'm
fine with it. There are two problems I would want to avoid though. Right
now pdeath_signal is reset on _any_ set*id() transition via
commit_creds(). Which makes it really useless.

For example, if you setup a container the child sets pdeath_signal so it
gets auto-killed when the container setup process dies. But as soon as
the child uses set*id() calls to become privileged over the container's
namespaces pdeath_signal magically gets reset. So all container runtimes
have this annoying code in some form:

static int do_start(void *data) /* container workload that gets setup */
{

<snip>

/* This prctl must be before the synchro, so if the parent dies before
* we set the parent death signal, we will detect its death with the
* synchro right after, otherwise we have a window where the parent can
* exit before we set the pdeath signal leading to a unsupervized
* container.
*/
ret = lxc_set_death_signal(SIGKILL, handler->monitor_pid, status_fd);
if (ret < 0) {
SYSERROR("Failed to set PR_SET_PDEATHSIG to SIGKILL");
goto out_warn_father;
}

<snip>

/* If we are in a new user namespace, become root there to have
* privilege over our namespace.
*/
if (!list_empty(&handler->conf->id_map)) {

<snip>

/* Drop groups only after we switched to a valid gid in the new
* user namespace.
*/
if (!lxc_drop_groups() &&
(handler->am_root || errno != EPERM))
goto out_warn_father;

if (!lxc_switch_uid_gid(nsuid, nsgid))
goto out_warn_father;

ret = prctl(PR_SET_DUMPABLE, prctl_arg(1), prctl_arg(0),
prctl_arg(0), prctl_arg(0));
if (ret < 0)
goto out_warn_father;

/* set{g,u}id() clears deathsignal */
ret = lxc_set_death_signal(SIGKILL, handler->monitor_pid, status_fd);
if (ret < 0) {
SYSERROR("Failed to set PR_SET_PDEATHSIG to SIGKILL");
goto out_warn_father;
}

<sip>

I can't stress how useless this often makes pdeath_signal. Let alone
that the child must set it so there's always a race with the parent
dying while the child is setting it. And obviously it isn't just
containers. It's anything that deprivileges itself including some
services.

If we require the seccomp task_no_new_privs() thing I really really
would like to not have to reset autokill during commit_creds().

Because then it is at least consistent for task_no_new_privs() without
magic resets.

TL;DR as long as we can come up with a model where there are no magical
resets of the property by the kernel this is useful.