Re: [RFC PATCH 02/27] containers: Implement containers as kernel objects

From: Ian Kent
Date: Thu Feb 21 2019 - 05:39:43 EST

On Wed, 2019-02-20 at 14:26 +0100, Christian Brauner wrote:
> On Wed, Feb 20, 2019 at 10:46:24AM +0800, Ian Kent wrote:
> > On Fri, 2019-02-15 at 16:07 +0000, David Howells wrote:
> > > Implement a kernel container object such that it contains the following
> > > things:
> > >
> > > (1) Namespaces.
> > >
> > > (2) A root directory.
> > >
> > > (3) A set of processes, including one designated as the 'init' process.
> >
> > Yeah, I think a name other than init needs to be used for this
> > process.
> >
> > The problem being that there is no requirement for container
> > process 1 to behave in any way like an "init" process is
> > expected to behave and that leads to confusion (at least
> > it certainly did for me).
> If you look at the documentation for pid namespaces(7) you can see that
> the pid 1 inside a pid namespace is expected to behave like an init
> process:
> - "The first process created in a new namespace [...] has the PID 1,
> and is the "init" process for the namespace (see init(1))."
> - "[...] child process that is orphaned within the namespace will be
> reparented to this process rather than init(1) [...]"
> - "If the "init" process of a PID namespace terminates, the kernel
> terminates all of the processes in the namespace via a SIGKILL
> signal. This behavior reflects the fact that the "init" process is
> essential for the corâ rect operation of a PID namespace."
> - "Only signals for which the "init" process has established a signal
> handler can be sent to the "init" process by other members of the
> PID namespace."
> - "[...] the reboot(2) system call causes a signal to be sent to the
> namespace "init" process."
> This is one of the reasons why all major current container runtimes
> finally after years of failing to realize this run a stub init process
> that mimicks a dumb init. Sure, you get away with not having an init
> that behaves like an init but this is inherently broken or at least
> against the way pid namespaces were designed.

TBH I wasn't sure why the signal I sent didn't arrive, AFAICS
it should have regardless of what signals the container init
process was accepting. But it could have been due to a
different problem in my kernel code (that's very likely).

In any case it wasn't worth perusing because even if I did work
it out I had already found that the request_key sub-system wasn't
playing well with others when trying to run something within a
container's namespaces, so no point in going further ...