Re: cgroup namespace and user namespace interactions
From: Aleksa Sarai
Date: Fri Apr 29 2016 - 01:54:09 EST
> The new cgroup namespace currently only allows for superficial
> interaction with the user namespace (it checks against the namespace
> it was created in whether or not a user has the right capabilities
> before allowing mounting, and things like that). However, there is one
> glaring feature that appears to be missing from the new cgroup
> namespace implementation: unprivileged user namespaces can't modify
> their sub-hierarchy. This is particularly frustrating for the
> containerisation community, where we are working on adding support for
> "rootless containers" in runC (the execution driver of Docker)[1]. It
> essentially means that we can't use cgroup resource limiting to limit
> *the resources of our own processes*. It also makes things like the
> freezer cgroup unusable.
>
> Here follows how I think we can solve this issue: the most obvious way
> of dealing with this would be (in the cgroupv1 view) to create a new
> subtree in every controller when you CLONE_NEWCGROUP. This new subtree
> is the root of the process's cgroup hierarchy. This doesn't affect any
> resource control, but it will result in the process only being able to
> affect its *own* resources. However, for cgroupv2 we have the "No
> Internal Process Constraint". So, maybe we could also move all of the
> other processes into a sibling subtree (with the *exact same* access
> permissions as the parent). Thus, the operation would look like this:
>
> - C0 - P00
> \ P01
> \ P02 (about to setns)
>
> becomes
>
> - C0 - C00 - P00
> \ P01
> \ C01 - P02
>
> But then we have C00 which is just a waste of cycles (it doesn't have
> any resource settings). So maybe there's some optimisation we can do
> there, but that's as far as I've gotten into thinking about how to
> deal with the constraints of cgroupv2. After that's been solved we can
> reuse how we store the user namespace the cgroup was created in
> (cgroup_namespace.user_ns), and just check that whatever user is
> trying to modify the cgroup has CAP_SYS_ADMIN in that user namespace.
>
> Do you think this would work? Are there any recommendations on whether
> we can make this work better? Also, can you clarify whether or not
> CLONE_NEWCGROUP only works for cgroupv2 or does it also work on
> cgroupv1 (we haven't yet transitioned to cgroupv2 in runC).
>
> Thanks.
>
> [1]: https://github.com/opencontainers/runc/pull/774
Does anyone have an opinion on this proposal?
--
Aleksa Sarai (cyphar)
www.cyphar.com