From: Jürg Billeter
Date: Tue Oct 03 2017 - 13:01:47 EST

On Tue, 2017-10-03 at 09:46 -0500, Eric W. Biederman wrote:
> There is a general need to find out about the death of other processes,
> if you are not the parent of the process. I would be inclined to call
> it waitfd. Something that you give a pid. It performs a permission
> check and the pid becomes readable when the process dies. With poll
> working on the fd, and the fd returning wstatus of the dead child.
> Support SIGIO on the fd and you have a signal delivery mechanism,
> if you want it.

File descriptors for processes (waitfd/clonefd) are definitely
interesting. Especially if reaping the process (and reparenting its
children) is delayed until the last process file descriptor is closed.
However, this would be a much larger addition and also less intuitive
to use if all you want is killing the process tree.

> For the kill all children when the parent dies the mechanism you are
> proposing is escapable. We already have an inescapable version of it
> with init in a pid namespace. We already have an escapable version of
> it with orphaned process groups and SIGHUP.
> So I would really appreciate a very clear use case for what we are
> building here. As it appears the killing of children can already be
> done another way, and that the waiting for the parent can be done better
> another way.

My use case is to provide a way for a process to spawn a child and
ensure that no descendants survive when that child dies. Avoiding
runaway processes is desirable in many situations. My motivation is
very lightweight (nested) sandboxing (every process is potentially

I.e., pid namespaces would be a pretty good fit (assuming they are
sufficiently lightweight) but CLONE_NEWPID requires CAP_SYS_ADMIN.
User namespaces can help here, but creating tons of user namespaces
just for this doesn't sound sensible. MAX_PID_NS_LEVEL could be an
issue as well at some point but 32 levels are likely fine in practice.

For my particular scenario I may actually be able to create a single
user namespace, run all processes with (namespaced) CAP_SYS_ADMIN and
use CLONE_NEWPID for every process. However, I would prefer not
requiring CAP_SYS_ADMIN and a regular application that wants to avoid
runaway processes for a spawned helper process cannot rely on

My plan was to use PR_SET_PDEATHSIG_PROC with PR_NO_NEW_PRIVS and a
suitable seccomp filter to prevent changes to pdeath_signal_proc. For
my SIGKILL use case it would be even better to simply require
PR_NO_NEW_PRIVS and make pdeath_signal_proc sticky, avoiding the need
for seccomp. I wanted to keep the differences to the existing
PR_SET_PDEATHSIG minimal but if we argue that the non-SIGKILL use case
is better solved with waitfd (or maybe the process events connector),
we could tailor the prctl for the SIGKILL use case (or support both via
prctl arg3).

I have another small patch locally that adds a prctl that restricts
kill(2) to direct children of the current thread group for lightweight
sandboxing. That would also be redundant if it was possible to use
CLONE_NEWPID for every process.

What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN?
Does CLONE_NEWPID pose any risks that don't exist for
CLONE_NEWUSER|CLONE_NEWPID? Assuming we can't simply drop the
CAP_SYS_ADMIN requirement, do you see a better solution for this use