[PATCH 7/8] cgroup: Add documentation for cgroup namespaces

From: serge . hallyn
Date: Tue Dec 22 2015 - 23:24:55 EST


From: Aditya Kali <adityakali@xxxxxxxxxx>

Signed-off-by: Aditya Kali <adityakali@xxxxxxxxxx>
Signed-off-by: Serge Hallyn <serge.hallyn@xxxxxxxxxxxxx>
---
Changelog (2015-12-08):
Merge into Documentation/cgroup.txt
Changelog (2015-12-22):
Reformat to try to follow the style of the rest of the cgroup.txt file.

Signed-off-by: Serge Hallyn <serge.hallyn@xxxxxxxxxx>
---
Documentation/cgroup.txt | 150 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 150 insertions(+)

diff --git a/Documentation/cgroup.txt b/Documentation/cgroup.txt
index 31d1f7b..03ad757 100644
--- a/Documentation/cgroup.txt
+++ b/Documentation/cgroup.txt
@@ -47,6 +47,7 @@ CONTENTS
5-3. IO
5-3-1. IO Interface Files
5-3-2. Writeback
+6. Namespaces
P. Information on Kernel Programming
P-1. Filesystem Support for Writeback
D. Deprecated v1 Core Features
@@ -1013,6 +1014,155 @@ writeback as follows.
vm.dirty[_background]_ratio.


+6. Cgroup Namespaces
+
+Cgroup namespaces provides a mechanism to virtualize the view of the
+"/proc/$PID/cgroup" file. The CLONE_NEWCGROUP clone flag can be used with
+clone() and unshare() syscalls to create a new cgroup namespace. The process
+running inside the cgroup namespace will have its "/proc/$PID/cgroup" output
+restricted to cgroupns root. The cgroupns root is the cgroup of the process at
+the time of creation of the cgroup namespace.
+
+Prior to cgroup namespaces, the "/proc/$PID/cgroup" file showed the complete
+path of the cgroup of a process. In a container setup where a set of cgroups
+and namespaces are intended to isolate processes the "/proc/$PID/cgroup" file
+may leak potential system level information to the isolated processes.
+
+For Example:
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+The path '/batchjobs/container_id1' can generally be considered as system-data
+and its desirable to not expose it to the isolated process.
+
+Cgroup namespaces can be used to restrict visibility of this path.
+For example, before creating a cgroup namespace, one would see:
+
+ # ls -l /proc/self/ns/cgroup
+ lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+After unsharing a new namespace, the view has changed.
+
+ # ls -l /proc/self/ns/cgroup
+ lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+
+While a task in the global cgroup namespace sees the full path.
+
+ # cat /proc/$PID/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1
+
+If also unsharing the user and mounts namespaces, then when mounting cgroupfs
+then the mount's root will be the task's cgroup.
+
+ # lxc-usernsexec --unshare -m -c
+ # mount -t cgroup cgroup /tmp/cgroup
+ # ls -l /tmp/cgroup
+ total 0
+ -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.controllers
+ -r--r--r-- 1 root root 0 2014-10-13 09:32 cgroup.populated
+ -rw-r--r-- 1 root root 0 2014-10-13 09:25 cgroup.procs
+ -rw-r--r-- 1 root root 0 2014-10-13 09:32 cgroup.subtree_control
+
+The cgroupns root (/batchjobs/container_id1 in above example) becomes the
+filesystem root for the namespace specific cgroupfs mount.
+
+The virtualization of /proc/self/cgroup file combined with restricting
+the view of cgroup hierarchy by namespace-private cgroupfs mount
+should provide a completely isolated cgroup view inside the container.
+
+In its current form, the cgroup namespaces patcheset provides following
+behavior:
+
+(1) The 'cgroupns root' for a cgroup namespace is the cgroup in which
+ the process calling unshare is running.
+ For ex. if a process in /batchjobs/container_id1 cgroup calls unshare,
+ cgroup /batchjobs/container_id1 becomes the cgroupns root.
+ For the init_cgroup_ns, this is the real root ('/') cgroup
+ (identified in code as cgrp_dfl_root.cgrp).
+
+(2) The cgroupns root cgroup does not change even if the namespace
+ creator process later moves to a different cgroup.
+ # ~/unshare -c # unshare cgroupns in some cgroup
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/
+ # mkdir sub_cgrp_1
+ # echo 0 > sub_cgrp_1/cgroup.procs
+ # cat /proc/self/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(3) Each process gets its namespace-specific view of "/proc/$PID/cgroup"
+
+(a) Processes running inside the cgroup namespace will be able to see
+ cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
+ From within an unshared cgroupns:
+ # sleep 100000 &
+ [1] 7353
+ # echo 7353 > sub_cgrp_1/cgroup.procs
+ # cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+
+(b) From the initial cgroup namespace, the real cgroup path will be visible:
+ $ cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/batchjobs/container_id1/sub_cgrp_1
+
+(c) From a sibling cgroup namespace (that is, a namespace rooted at a
+ different cgroup), the cgroup path relative to its own cgroup namespace
+ root will be shown. For instance, if PID 7353's cgroup namespace root is
+ at '/batchjobs/container_id2', then it will see
+
+ # cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2/sub_cgrp_1
+
+ Note that the relative path always starts with '/' to indicate that its
+ relative to the cgroup namespace root of the caller.
+
+(4) Processes inside a cgroup namespace can move into and out of the namespace
+ root if they have proper access to external cgroups. So from inside a
+ namespace with cgroupns root at /batchjobs/container_id1, and
+ assuming that the global hierarchy is still accessible inside cgroupns:
+
+ # cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/sub_cgrp_1
+ # echo 7353 > batchjobs/container_id2/cgroup.procs
+ # cat /proc/7353/cgroup
+ 0:cpuset,cpu,cpuacct,memory,devices,freezer,hugetlb:/../container_id2
+
+ Note that this kind of setup is not encouraged. A task inside cgroup
+ namespace should only be exposed to its own cgroupns hierarchy. Otherwise
+ it makes the virtualization of "/proc/$PID/cgroup" less useful.
+
+(5) Setns to another cgroup namespace is allowed when:
+ (a) the process has CAP_SYS_ADMIN against its current user namespace
+ (b) the process has CAP_SYS_ADMIN against the target cgroup namespace's
+ userns
+ No implicit cgroup changes happen with attaching to another cgroup
+ namespace. It is expected that the somone moves the attaching process under
+ the target cgroup namespace root.
+
+(6) When some thread from a multi-threaded process unshares its
+ cgroup namespace, the new cgroupns gets applied to the entire process (all
+ the threads). For the unified hierarchy this is expected as it only allows
+ process level containerization. For the legacy hierarchies this may be
+ unexpected. So all the threads in the process will have the same cgroup.
+
+(7) The cgroup namespace is alive as long as there is at least 1
+ process inside it. When the last process exits, the cgroup
+ namespace is destroyed. The cgroupns root and the actual cgroups
+ remain.
+
+(8) Namespace specific cgroup hierarchy can be mounted by a process running
+ inside a non-init cgroup namespace:
+
+ # mount -t cgroup -o __DEVEL__sane_behavior cgroup $MOUNT_POINT
+
+ This will mount the unified cgroup hierarchy with cgroupns root as the
+ filesystem root. The process needs CAP_SYS_ADMIN against its user and
+ mounts namespaces.
+
P. Information on Kernel Programming

This section contains kernel programming information in the areas
--
1.7.9.5

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/