RFC: making cn_proc work in {pid,user} namespaces
From: Aleksa Sarai
Date: Sun Oct 15 2017 - 06:06:08 EST
Hi all,
At the moment, cn_proc is not usable by containers or container
runtimes. In addition, all connectors have an odd relationship with
init_net (for example, /proc/net/connectors only exists in init_net).
There are two main use-cases that would be perfect for cn_proc, which is
the reason for me pushing this:
First, when adding a process to an existing container, in certain modes
runc would like to know that process's exit code. But, when joining a
PID namespace, it is advisable[1] to always double-fork after doing the
setns(2) to reparent the joining process to the init of the container
(this causes the SIGCHLD to be received by the container init). It would
also be useful to be able to monitor the exit code of the init process
in a container without being its parent. At the moment, cn_proc doesn't
allow unprivileged users to use it (making it a problem for user
namespaces and "rootless containers"). In addition, it also doesn't
allow nested containers to use it, because it requires the process to be
in init_pid. As a result, runc cannot use cn_proc and relies on SIGCHLD
(which can only be used if we don't double-fork, or keep around a
long-running process which is something that runc also cannot do).
Secondly, there are/were some init systems that rely on cn_proc to
manage service state. From a "it would be neat" perspective, I think it
would be quite nice if such init systems could be used inside
containers. But that requires cn_proc to be able to be used as an
unprivileged user and in a pid namespace other than init_pid.
The /proc/net/connectors thing is quite easily resolved (just make it
the connector driver perdev and make some small changes to make sure the
interfaces stay sane inside of a container's network namespace). I'm
sure that we'll probably have to make some changes to the registration
API, so that a connector can specify whether they want to be visible to
non-init_net namespaces.
However, the cn_proc problem is a bit harder to resolve nicely and there
are quite a few interface questions that would need to be agreed upon.
The basic idea would be that a process can only get cn_proc events if it
has ptrace_may_access rights over said process (effectively a forced
filter -- which would ideally be done send-side but it looks like it
might have to be done receive-side). This should resolve possible
concerns about an unprivileged process being able to inspect (fairly
granular) information about the host. And obviously the pids, uids, and
gids would all be translated according to the receiving process's user
namespaces (if it cannot be translated then the message is not
received). I guess that the translation would be done in the same way as
SCM_CREDENTIALS (and cgroup.procs files), which is that it's done on the
receive side not the send side.
My reason for sending this email rather than just writing the patch is
to see whether anyone has any solid NACKs against the use-case or
whether there is some fundamental issue that I'm not seeing. If nobody
objects, I'll be happy to work on this.
[1]: https://lwn.net/Articles/532748/
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/