Re: [PATCH v3 2/2] cgroup: allow management of subtrees by new cgroup namespaces

From: Aleksa Sarai
Date: Mon May 02 2016 - 21:52:35 EST


Change the mode of the cgroup directory for each cgroup association,
allowing the process to create subtrees and modify the limits of the
subtrees *without* allowing the process to modify its own limits. Due to
the cgroup core restrictions and unix permission model, this allows for
processes to create new subtrees without breaking the cgroup limits for
the process.

I don't get why this is necessary. What's wrong with the parent
setting up permission correctly for the namespace?

The parent setting this up requires either:

1. A privileged process giving the process write access to the cgroup directory it is currently in. Since no software does this by default, and in addition it might not always make sense (systemd doesn't like processes messing around in their respective cgroups), this has to be dealt with better.

2. The process itself is a privileged process, which is not the usecase I'm going for with rootless containers. If you have root, you can do whatever you want in this regard and this feature doesn't affect you.

The main reason for this patchset is because I would like to make sure that unprivileged processes can take advantage of cgroup features (such as the freezer cgroup, and to just do regular resource limiting). Since cgroups are a hierarchy, I can see no fundamental reason why this is not possible. And the cgroup namespace appears to be the perfect way of doing it. I firmly believe there is a simple and safe way of allowing unprivileged processes to create subtrees of their current cgroup.

However, I agree with James that this patchset isn't ideal (it was my first rough attempt). I think I'll get to work on properly virtualising /sys/fs/cgroup, which will allow for a new cgroup namespace to modify subtrees (but without allowing for cgroup escape) -- by pinning what pid namespace the cgroup was created under. We can use the same type of virtualization that /proc does (except instead of selectively showing the dentries, we selectively show different owners of the dentries).

Would that be acceptable?

--
Aleksa Sarai
Software Engineer (Containers)
SUSE Linux GmbH
https://www.cyphar.com/