Re: [PATCH v2] cgroup: allow management of subtrees by new cgroup namespaces

From: James Bottomley
Date: Tue May 03 2016 - 10:26:38 EST

On Tue, 2016-05-03 at 16:48 +1000, Aleksa Sarai wrote:
> > > > Perhaps what you should to be arguing then that the default
> > > > permissions of the cgroup directories need to be all rwx for
> > > > everyone and then your patch becomes unnecessary?
> > >
> > > I don't think that would be the nicest way of dealing with this
> > > (then a process can make very large numbers of cgroups all over
> > > the tree, which might not cause huge issues but would still be a
> > > pain for administrators and systemds alike).
> >
> > Beware of what you cite as a problem. Any user can enter a user
> > namespace and then unshare a cgroup namespace. This means that
> > what you seem to want is equivalent to any user at all being able
> > to create a cgroup hierarchy.
> They should only be allowed to make subtrees of the cgroup *they
> currently reside in* IMO.

For the usual case that is the top level cgroup because most processes
don't get initially confined. If there is initial confinement by
something, then whatever it is could alter the permissions as well.

So if the default case is equivalent to making all the initial top
level cgroups rwx, we should understand the implications of that and
the best way to concentrate minds is to ask what happens if it were the

> Making the hierarchies chmod(0777) would allow any process in any
> hierarchy to create cgroups anywhere in the tree. It would just make
> management within cgroupv2 much harder (especially with the no
> internal process semantics of cgroupv2). This wouldn't happen with
> the cgroup namespace.
> It should be noted that cgroupv1 doesn't have the same protection I
> outline below, so this would actually cause cgroup escapes in that
> case (which we should obviously avoid).

> > [...] This means that either it is a problem, and the
> > cgroup namespace will have to be restricted in some way over how it
> > can create subordinate cgroups or it's not a problem and we might
> > as well just see what happens if any old user can do it.
> See above. The only restriction I think is necessary is that the
> process can only create subtrees of the current cgroup it is in.

I don't think there's a disagreement about that. The problem is that
the current cgroup is usually the top level one because most systems
don't place processes into any other cgroup in the default case.

> > > > Alternatively, if the desire is fully to virtualize
> > > > /sys/fs/cgroups, then I think we have to decide how that would
> > > > happen. I think the default requirements would be that a pid
> > > > namespace be established (so only the tasks in that pid
> > > > namespace would be able to be controlled by the cgroup
> > > > namespace. That, I think requires that any given cgroup
> > > > namespace "own" a pid namespace (being the one present when it
> > > > was created) but that it only gets a new virtual set of
> > > > directories owned by the userns owner if there's a pid
> > > > namespace established for the cgroup and cgroup->user_ns ==
> > > > pid_ns->user_ns (meaning we established a user ns then a pid
> > > > one then a cgroup one, so it's now safe to treat root in the
> > > > user_ns as owning the virtualized cgroup directories).
> > >
> > > I know this is probably a stupid question, but why couldn't we
> > > just compare the user_ns with the tcred->user_ns?
> >
> > If any old user namespace can unshare a cgroup namespace and
> > manipulate the tree, then that condition is just fine. If we're
> > going to require they have to create a pid namespace as well, then
> > you need a more elaborate condition.

> Well, I guess my question was more like "if we don't require the pid
> namespace pinning, what bad things will happen?". You've described
> the corner case, and I'm not sure it's a problem for cgroupv2. It is
> a problem for cgroupv1 (unfortunately), due to backwards
> compatibility reasons. So, we have to decide whether we are going to
> add a restriction for cgroup namespaces, so that this functionality
> can be implemented for both versions -- or should we only implement
> the minimal version (which would only work on cgroupv2).
> If we decide to implement both, we have to agree on the restrictions
> *immediately* because the cgroup namespace was merged in 4.6-rc1 so
> changing the restrictions on it in 4.7 would probably be frowned
> upon.

No, that horse has left the stable: the cgroup namespace applies to
both v1 and v2.

> > > Or are you worried about a process in a cgroup namespace moving
> > > processes to a subtree that isn't in the same pid namespace (even
> > > though they're in the same user namespace)?
> >
> > The corner case I'm worrying about is what happens to a process
> > owned by the user that gets moved by the administrator to a more
> > confining cgroup after the establishment of the cgroup namespace?
> > If we allow too much capability to the user_ns->owner, then they
> > could just take it out again. The semantics of who can do what
> > after the namespace is established seem to need better definition.
> > One answer might be that after the cgroup namespace is established
> > , the real admin can't safely move the processes, which is why they
> > should be better confined (say within a pid namespace) so it's not
> > *all* processes owned by this user that can escape control, merely
> > ones that the user has declared a desire to control the cgroups
> > for).
> My thinking was that rename(2) would make this a simple decision, but
> I just realised that rename(2) doesn't let you change the hierarchy.
> But it should be noted that cgroupv2 has a fix for this: you can't
> move a task to another cgroup unless you have attach rights
> (cgroup.procs) to the common ancestor of the current cgroup and the
> target cgroup.

Currently the decision is made in cgroup_procs_write_permission() and
actually is blind to the user namespace, so this needs updating anyway.


> What this means is that you can only "fight the administrator" in the
> case that the admin has decided to move you inside the subtree of the
> cgroup namespace you are in. Itherwise, you can't move back to your
> old cgroup. Of course, it means that the cgroup namespace is now
> broken -- but that's what the admin wanted to do. I don't think that
> should be a problem.
> Since this isn't available in cgroupv1, there's a question about
> whether this functionality should be allowed for cgroupv1. Since
> cgroupv2 still doesn't support all of the controllers needed for many
> container runtimes to work, I don't think we should not implement
> this for cgroupv1. But that's just my opinion.