From: Eric W. Biederman
Date: Tue Oct 03 2017 - 13:41:08 EST

JÃrg Billeter <j@xxxxxxxxx> writes:

> On Tue, 2017-10-03 at 09:46 -0500, Eric W. Biederman wrote:
>> There is a general need to find out about the death of other processes,
>> if you are not the parent of the process. I would be inclined to call
>> it waitfd. Something that you give a pid. It performs a permission
>> check and the pid becomes readable when the process dies. With poll
>> working on the fd, and the fd returning wstatus of the dead child.
>> Support SIGIO on the fd and you have a signal delivery mechanism,
>> if you want it.
> File descriptors for processes (waitfd/clonefd) are definitely
> interesting. Especially if reaping the process (and reparenting its
> children) is delayed until the last process file descriptor is closed.
> However, this would be a much larger addition and also less intuitive
> to use if all you want is killing the process tree.
>> For the kill all children when the parent dies the mechanism you are
>> proposing is escapable. We already have an inescapable version of it
>> with init in a pid namespace. We already have an escapable version of
>> it with orphaned process groups and SIGHUP.
>> So I would really appreciate a very clear use case for what we are
>> building here. As it appears the killing of children can already be
>> done another way, and that the waiting for the parent can be done better
>> another way.
> My use case is to provide a way for a process to spawn a child and
> ensure that no descendants survive when that child dies. Avoiding
> runaway processes is desirable in many situations. My motivation is
> very lightweight (nested) sandboxing (every process is potentially
> sandboxed).
> I.e., pid namespaces would be a pretty good fit (assuming they are
> sufficiently lightweight) but CLONE_NEWPID requires CAP_SYS_ADMIN.
> User namespaces can help here, but creating tons of user namespaces
> just for this doesn't sound sensible. MAX_PID_NS_LEVEL could be an
> issue as well at some point but 32 levels are likely fine in practice.
> For my particular scenario I may actually be able to create a single
> user namespace, run all processes with (namespaced) CAP_SYS_ADMIN and
> use CLONE_NEWPID for every process. However, I would prefer not
> requiring CAP_SYS_ADMIN and a regular application that wants to avoid
> runaway processes for a spawned helper process cannot rely on
> My plan was to use PR_SET_PDEATHSIG_PROC with PR_NO_NEW_PRIVS and a
> suitable seccomp filter to prevent changes to pdeath_signal_proc. For
> my SIGKILL use case it would be even better to simply require
> PR_NO_NEW_PRIVS and make pdeath_signal_proc sticky, avoiding the need
> for seccomp. I wanted to keep the differences to the existing
> PR_SET_PDEATHSIG minimal but if we argue that the non-SIGKILL use case
> is better solved with waitfd (or maybe the process events connector),
> we could tailor the prctl for the SIGKILL use case (or support both via
> prctl arg3).
> I have another small patch locally that adds a prctl that restricts
> kill(2) to direct children of the current thread group for lightweight
> sandboxing. That would also be redundant if it was possible to use
> CLONE_NEWPID for every process.

I believe the current default limits allow using CLONE_NEWPID for every
process. The data structures seem light enough as well.

> What's actually the reason that CLONE_NEWPID requires CAP_SYS_ADMIN?
> Does CLONE_NEWPID pose any risks that don't exist for
> CLONE_NEWUSER|CLONE_NEWPID? Assuming we can't simply drop the
> CAP_SYS_ADMIN requirement, do you see a better solution for this use
> case?

CLONE_NEWPID without a permission check would allow runing a setuid root
application in a pid namespace. Off the top of my head I can't think of
a really good exploit. But when you mess up pid files, and hide
information from a privileged application I can completely imagine
forcing that application to misbehave in ways the attacker can control.
Leading to bad things.