[PATCH v4 0/4] pidfd: add CLONE_AUTOREAP and CLONE_PIDFD_AUTOKILL

From: Christian Brauner

Date: Mon Feb 23 2026 - 05:46:00 EST


Add two new clone3() flags for pidfd-based process lifecycle management.

CLONE_AUTOREAP makes a child process auto-reap on exit without ever
becoming a zombie. This is a per-process property in contrast to the
existing auto-reap mechanism via SA_NOCLDWAIT or SIG_IGN for SIGCHLD
which applies to all children of a given parent.

Currently the only way to automatically reap children is to set
SA_NOCLDWAIT or SIG_IGN on SIGCHLD. This is a parent-scoped property
affecting all children which makes it unsuitable for libraries or
applications that need selective auto-reaping of specific children while
still being able to wait() on others.

CLONE_AUTOREAP stores an autoreap flag in the child's signal_struct.
When the child exits do_notify_parent() checks this flag causes
exit_notify() to transition the task directly to EXIT_DEAD. Since the
flag lives on the child it survives reparenting: if the original parent
exits and the child is reparented to a subreaper or init the child still
auto-reaps when it eventually exits. This is cleaner then forcing the
subreaper to get SIGCHLD and then reaping it. If the parent doesn't care
the subreaper won't care. If there's a subreaper that would care it
would be easy enough to add a prctl() that either just turns back on
SIGCHLD and turns of auto-reaping or a prctl() that just notifies the
subreaper whenever a child is reparented to it.

CLONE_AUTOREAP can be combined with CLONE_PIDFD to allow the parent to
monitor the child's exit via poll() and retrieve exit status via
PIDFD_GET_INFO. Without CLONE_PIDFD it provides a fire-and-forget
pattern. No exit signal is delivered so exit_signal must be zero.

The flag is not inherited by the autoreap process's own children. Each
child that should be autoreaped must be explicitly created with
CLONE_AUTOREAP.

CLONE_PIDFD_AUTOKILL ties a child's lifetime to the pidfd returned from
clone3(). When the last reference to the struct file created by clone3()
is closed the kernel sends SIGKILL to the child. A pidfd obtained via
pidfd_open() for the same process does not keep the child alive and does
not trigger autokill - only the specific struct file from clone3() has
this property. This is useful for container runtimes, service managers,
and sandboxed subprocess execution - any scenario where the child must
die if the parent crashes or abandons the pidfd or just wants a
throwaway helper process.

CLONE_PIDFD_AUTOKILL requires both CLONE_PIDFD and CLONE_AUTOREAP. It
requires CLONE_PIDFD because the whole point is tying the child's
lifetime to the pidfd. It requires CLONE_AUTOREAP because a killed child
with no one to reap it would become a zombie - the primary use case is
the parent crashing or abandoning the pidfd so no one is around to call
waitpid().

CLONE_PIDFD_AUTOKILL automatically sets no_new_privs on the child
process. This ensures the child cannot escalate privileges beyond the
parent's credential level via setuid/setgid exec. Because the child can
never can more privileges than the parent the autokill SIGKILL is always
within the parent's authority. This avoids the pdeath_signal trap where
the kernel resets the property during secureexec and commit_creds()
making it useless for container runtimes and service managers that
deprivilege themselves. The no_new_privs restriction only affects the
child. The parent retains full privileges.

The clone3 pidfd is identified by the PIDFD_AUTOKILL file flag set on
the struct file at clone3() time. The pidfs .release handler checks this
flag and sends SIGKILL only when it is set. dup()/fork() share the same
struct file so they extend the child's lifetime until the last reference
drops.

Signed-off-by: Christian Brauner <brauner@xxxxxxxxxx>
---
Changes in v4:
- Set no_new_privs on child when CLONE_PIDFD_AUTOKILL is used. This
prevents the child from escalating privileges via setuid/setgid exec
and eliminates the need for magical resets during credential changes.
The parent retains full privileges.
- Replace autokill_pidfd pointer with PIDFD_AUTOKILL file flag checked
in pidfs_file_release(). This eliminates the need for pointer
comparison, stale pointer concerns, and WRITE_ONCE/READ_ONCE pairing
(Oleg, Jann).
- Reject CLONE_AUTOREAP | CLONE_PARENT to prevent a CLONE_AUTOREAP
child from creating silent zombies via clone(CLONE_PARENT) (Oleg).
- Link to v3: https://patch.msgid.link/20260217-work-pidfs-autoreap-v3-0-33a403c20111@xxxxxxxxxx

Changes in v2:
- Add CLONE_PIDFD_AUTOKILL flag
- Decouple CLONE_AUTOREAP from CLONE_PIDFD: the autoreap mechanism has
no dependency on pidfds. This allows fire-and-forget patterns where
the parent does not need exit status.
- Link to v1: https://patch.msgid.link/20260216-work-pidfs-autoreap-v1-0-e63f663008f2@xxxxxxxxxx

---
Christian Brauner (4):
clone: add CLONE_AUTOREAP
pidfd: add CLONE_PIDFD_AUTOKILL
selftests/pidfd: add CLONE_AUTOREAP tests
selftests/pidfd: add CLONE_PIDFD_AUTOKILL tests

fs/pidfs.c | 38 +-
include/linux/sched/signal.h | 1 +
include/uapi/linux/pidfd.h | 1 +
include/uapi/linux/sched.h | 2 +
kernel/fork.c | 34 +-
kernel/ptrace.c | 3 +-
kernel/signal.c | 4 +
tools/testing/selftests/pidfd/.gitignore | 1 +
tools/testing/selftests/pidfd/Makefile | 2 +-
.../testing/selftests/pidfd/pidfd_autoreap_test.c | 793 +++++++++++++++++++++
10 files changed, 868 insertions(+), 11 deletions(-)
---
base-commit: 6de23f81a5e08be8fbf5e8d7e9febc72a5b5f27f
change-id: 20260214-work-pidfs-autoreap-3ee677e240a8