Re: [RFC][PATCH 0/9] Make containers kernel objects

From: James Bottomley
Date: Tue May 23 2017 - 11:02:41 EST


On Tue, 2017-05-23 at 14:52 +0100, David Howells wrote:
> James Bottomley <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
>
> > This sounds like a step in the wrong direction: the strength of the
> > current container interfaces in Linux is that people who set up
> > containers don't have to agree what they look like.
>
> It may be a strength, but it is also a problem.
>
> > So I can set up a user namespace without a mount namespace or an
> > architecture emulation container with only a mount namespace.
>
> (I presume you mean with only the mount namespace separate)
>
> Yep. You can do that with this too.
>
> > But ignoring my fun foibles with containers and to give a concrete
> > example in terms of a popular orchestration system: in kubernetes,
> > where certain namespaces are shared across pods, do you imagine the
> > kernel's view of the "container" to be the pod or what kubernetes
> > thinks of as the container?
>
> Why not both? If the net_ns is created in the pod container, then
> probably
> network-related upcalls should be directed there. Unless instructed
> otherwise, upon creation a container object will inherit the caller's
> namespaces.

The pod isn't a container, it's a collection of containers. Lets say
each container has a separate mount namespace but shares a network
namespace (this is a gross simplification, there are many other ways
you can set up a pod, but this one illustrates the point). For your
upcall you'd have to pick a kubernetes container and you don't have the
information to do that, even with your current patches, because what
kubernetes has done. This is where your view of "container" doesn't
match the kubernetes view.

> > This is important, because half the examples you give below are
> > network related and usually pods share a network namespace.
>
> Yeah - I'm more familiar with upcalls made by NFS, AFS and keyrings.

OK, so rather than getting into the technical back and forth below can
we agree that the kernel can't have a unitary view of "container"
because the current use cases (the orchestration systems) don't have
one? Then the next step becomes how can we add an abstraction that
gives you what you want (as far as I can tell basically identifying a
set of namespaces for an upcall) in a way that doesn't bind the kernel
to have a unitary view of a container? And then we can tack the ideas
on to the Jeff/Eric subthread.

James