Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects

From: Christian Brauner
Date: Wed Feb 20 2019 - 08:26:13 EST

On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote:
> On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> > Implement a kernel container object such that it contains the following
> > things:
> >
> > (1) Namespaces.
> >
> > (2) A root directory.
> >
> > (3) A set of processes, including one designated as the 'init' process.
> Yeah, I think a name other than init needs to be used for this
> process.
> The problem being that there is no requirement for container
> process 1 to behave in any way like an "init" process is
> expected to behave and that leads to confusion (at least
> it certainly did for me).

If you look at the documentation for pid namespaces(7) you can see that
the pid 1 inside a pid namespace is expected to behave like an init
- "The first process created in a new namespace [...] has the PID 1,
and is the "init" process for the namespace (see init(1))."
- "[...] child process that is orphaned within the namespace will be
reparented to this process rather than init(1) [...]"
- "If the "init" process of a PID namespace terminates, the kernel
terminates all of the processes in the namespace via a SIGKILL
signal. This behavior reflects the fact that the "init" process is
essential for the corâ rect operation of a PID namespace."
- "Only signals for which the "init" process has established a signal
handler can be sent to the "init" process by other members of the
PID namespace."
- "[...] the reboot(2) system call causes a signal to be sent to the
namespace "init" process."

This is one of the reasons why all major current container runtimes
finally after years of failing to realize this run a stub init process
that mimicks a dumb init. Sure, you get away with not having an init
that behaves like an init but this is inherently broken or at least
against the way pid namespaces were designed.