Re: [RFC PATCH 0/4] namespacefs: Proof-of-Concept

From: James Bottomley
Date: Mon Nov 22 2021 - 08:44:14 EST

On Mon, 2021-11-22 at 15:02 +0200, Yordan Karadzhov wrote:
> On 20.11.21 г. 1:08 ч., James Bottomley wrote:
> > [trying to reconstruct cc list, since the cc: field is bust again]
> > > On Fri, 19 Nov 2021 11:47:36 -0500
> > > Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> > >
> > > > > Can we back up and ask what problem you're trying to solve
> > > > > before we start introducing new objects like namespace name?
> > >
> > > TL;DR; verison:
> > >
> > > We want to be able to install a container on a machine that will
> > > let us view all namespaces currently defined on that machine and
> > > which tasks are associated with them.
> > >
> > > That's basically it.
> >
> > So you mentioned kubernetes. Have you tried
> >
> > kubectl get pods --all-namespaces
> >
> > ?
> >
> > The point is that orchestration systems usually have interfaces to
> > get this information, even if the kernel doesn't. In fact,
> > userspace is almost certainly the best place to construct this
> > from.
> >
> > To look at this another way, what if you were simply proposing the
> > exact same thing but for the process tree. The push back would be
> > that we can get that all in userspace and there's even a nice tool
> > (pstree) to do it which simply walks the /proc interface. Why,
> > then, do we have to do nstree in the kernel when we can get all the
> > information in exactly the same way (walking the process tree)?
> >
> I see on important difference between the problem we have and the
> problem in your example. /proc contains all the
> information needed to unambiguously reconstruct the process tree.
> On the other hand, I do not see how one can reconstruct the namespace
> tree using only the information in proc/ (maybe this is because of my
> ignorance).

Well, no, the information may not all exist. However, the point is we
can add it without adding additional namespace objects.

> Let's look the following case (oversimplified just to get the idea):
> 1. The process X is a parent of the process Y and both are in
> namespace 'A'.
> 3. "unshare" is used to place process Y (and all its child processes)
> in a new namespace B (A is a parent namespace of B).
> 4. "setns" is s used to move process X in namespace C.
> How would you find the parent namespace of B?

Actually this one's quite easy: the parent of X in your setup still has

However, I think you're looking to set up a scenario where the
namespace information isn't carried by live processes and that's
certainly possible if we unshare the namespace, bind it to a mount
point and exit the process that unshared it. If will exist as a bound
namespace with no processes until it gets entered via the binding and
when that happens the parent information can't be deduced from the
process tree.

There's another problem, that I think you don't care about but someone
will at some point: the owning user_ns can't be deduced from the
current tree either because it depends on the order of entry. We fixed
unshare so that if you enter multiple namespaces, it enters the user_ns
first so the latter is always the owning namespace, but if you enter
the rest of the namespaces first via one unshare then unshare the
user_ns second, that won't be true.

Neither of the above actually matter for docker like containers because
that's not the way the orchestration system works (it doesn't use mount
bindings or the user_ns) but one day, hopefully, it might.

> Again, using your arguments, I can reformulate the problem statement
> this way: a userspace program is well instrumented
> to create an arbitrary complex tree of namespaces. In the same time,
> the only place where the information about the
> created structure can be retrieved is in the userspace program
> itself. And when we have multiple userspace programs
> adding to the namespaces tree, the global picture gets impossible to
> recover.

So figure out what's missing in the /proc tree and propose adding it.
The interface isn't immutable it's just that what exists today is an
ABI and can't be altered. I think this is the last time we realised we
needed to add missing information in /proc/<pid>/ns:

So you can use that as the pattern.