Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

From: Andy Lutomirski
Date: Tue Oct 21 2014 - 20:59:18 EST

Next message: Namhyung Kim: "Re: [PATCH 1/8] perf, tools: Support handling complete branch stacks as histograms"
Previous message: Dave Jones: "Re: [3.18rc1] btrfs triggering vm bug_on"
In reply to: Aditya Kali: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Next in thread: Aditya Kali: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Oct 21, 2014 at 5:46 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote:
> On Tue, Oct 21, 2014 at 3:42 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Tue, Oct 21, 2014 at 3:33 PM, Aditya Kali <adityakali@xxxxxxxxxx> wrote:
>>> On Tue, Oct 21, 2014 at 12:02 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>> On Tue, Oct 21, 2014 at 11:49 AM, Aditya Kali <adityakali@xxxxxxxxxx> wrote:
>>>>> On Mon, Oct 20, 2014 at 10:49 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>>>>> On Mon, Oct 20, 2014 at 10:42 PM, Eric W. Biederman
>>>>>> <ebiederm@xxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> I do wonder if we think of this as chcgrouproot if there is a simpler
>>>>>>> implementation.
>>>>>>
>>>>>> Could be. I'll defer to Aditya for that one.
>>>>>>
>>>>>
>>>>> More than chcgrouproot, its probably closer to pivot_cgroup_root. In
>>>>> addition to restricting the process to a cgroup-root, new processes
>>>>> entering the container should also be implicitly contained within the
>>>>> cgroup-root of that container.
>>>>
>>>> Why? Concretely, why should this be in the kernel namespace code
>>>> instead of in userspace?
>>>>
>>>
>>> Userspace can do it too. Though then there will be possibility of
>>> having processes in the same mount namespace with different
>>> cgroup-roots. Deriving contents of /proc/<pid>/cgroup becomes even
>>> more complex. Thats another reason why it might not be good idea to
>>> tie cgroups with mount namespace.
>>>
>>>>> Implementing pivot_cgroup_root would
>>>>> probably involve overloading mount-namespace to now understand cgroup
>>>>> filesystem too. I did attempt combining cgroupns-root with mntns
>>>>> earlier (not via a new syscall though), but came to the conclusion
>>>>> that its just simpler to have a separate cgroup namespace and get
>>>>> clear semantics. One of the issues was that implicitly changing cgroup
>>>>> on setns to mntns seemed like a huge undesirable side-effect.
>>>>>
>>>>> About pinning: I really feel that it should be OK to pin processes
>>>>> within cgroupns-root. I think thats one of the most important feature
>>>>> of cgroup-namespace since its most common usecase is to containerize
>>>>> un-trusted processes - processes that, for their entire lifetime, need
>>>>> to remain inside their container.
>>>>
>>>> So don't let them out. None of the other namespaces have this kind of
>>>> constraint:
>>>>
>>>> - If you're in a mntns, you can still use fds from outside.
>>>> - If you're in a netns, you can still use sockets from outside the namespace.
>>>> - If you're in an ipcns, you can still use ipc handles from outside.
>>>
>>> But none of the namespaces allow you to allocate new fds/sockets/ipc
>>> handles in the outside namespace. I think moving a process outside of
>>> cgroupns-root is like allocating a resource outside of your namespace.
>>
>> In a pidns, you can see outside tasks if you have an outside procfs
>> mounted, but, if you don't, then you can't. Wouldn't cgroupns be just
>> like that? You wouldn't be able to escape your cgroup as long as you
>> don't have an inappropriate cgroupfs mounted.
>>
>
> I am not if we should only depend on restricted visibility for this
> though. More details below.
>
>>
>>>>
>>>>> And with explicit permission from
>>>>> cgroup subsystem (something like cgroup.may_unshare as you had
>>>>> suggested previously), we can make sure that unprivileged processes
>>>>> cannot pin themselves. Also, maintaining this invariant (your current
>>>>> cgroup is always under your cgroupns-root) keeps the code and the
>>>>> semantics simple.
>>>>
>>>> I actually think it makes the semantics more complex. The less policy
>>>> you stick in the kernel, the easier it is to understand the impact of
>>>> that policy.
>>>>
>>>
>>> My inclination is towards keeping things simpler - both in code as
>>> well as in configuration. I agree that cgroupns might seem
>>> "less-flexible", but in its current form, it encourages consistent
>>> container configuration. If you have a process that needs to move
>>> around between cgroups belonging to different containers, then that
>>> process should probably not be inside any container's cgroup
>>> namespace. Allowing that will just make the cgroup namespace
>>> pretty-much meaningless.
>>
>> The problem with pinning is that preventing it causes problems
>> (specifically, either something potentially complex and incompatible
>> needs to be added or unprivileged processes will be able to pin
>> themselves).
>>
>> Unless I'm missing something, a normal cgroupns user doesn't actually
>> need kernel pinning support to effectively constrain its members'
>> cgroups.
>>
>
> So there are 2 scenarios to consider:
>
> We have 2 containers with cgroups: /container1 and /container2
> Assume process P is running under cgroupns-root '/container1'
>
> (1) process P wants to 'write' to cgroup.procs outside its
> cgroupns-root (say to /container2/cgroup.procs)

This, at least, doesn't have the problem with unprivileged processes
pinning themselves.

> (2) An admin process running in init_cgroup_ns (or any parent cgroupns
> with cgroupns-root above /container1) wants to write pid of process P
> to /container2/cgroup.procs (which lies outside of P's cgroupns-root)
>
> For (1), I think its ok to reject such a write. This is consistent
> with the restriction in cgroup_file_write added in 'Patch 6' of this
> set. I believe this should be independent of visibility of the cgroup
> hierarchy for P.
>
> For (2), we may allow the write to succeed if we make sure that the
> process doing the write is an admin process (with CAP_SYS_ADMIN in its
> userns AND over P's cgroupns->user_ns).

Why is its userns relevant?

Why not just check whether the target cgroup is in the process doing
the write's cgroupns? (NB: you need to check f_cred, here, not
current_cred(), but that's orthogonal.) Then the policy becomes: no
user of cgroupfs can move any process outside of the cgroupfs's user's
cgroupns root.

I think I'm okay with this.

> If this write succeeds, then:
> (a) process P's /proc/<pid>/cgroup does not show anything when viewed
> by 'self' or any other process in P's cgrgroupns. I would really like
> to avoid showing relative paths or paths outside the cgroupns-root

The empty string seems just as problematic to me.

> (b) if process P does 'mount -t cgroup cgroup <mnt>', it will still be
> only able to mount and see cgroup hierarchy under its cgroupns-root
> (d) if process P tries to write to any cgroup file outside of its
> cgroupns-root (assuming that hierarchy is visible to it for whatever
> reason), it will fail as in (1)

I'm still unconvinced that this serves any purpose. If you give
DAC/MAC permission to a task to write to something, and you give it
access to an fd or mount pointing there, and you don't want it writing
there, then *don't do that*. I'm not really seeing why cgroupns needs
special treatment here.

>
> i.e., in summary, you can't escape out of cgroupns-root yourself. You
> will need help from an admin process running under some parent
> cgroupns-root to move you out. Is that workable for your usecase? Most
> of the things above already happen with the current patch-set, so it
> should be easy to enable this.
>
> Though there are still some open issues like:
> * what happens if you move all the processes out of /container1 and
> then 'rmdir /container1'? As it is now, you won't be able to setns()
> to that cgroupns anymore. But the cgroupns will still hang around
> until the processes switch their cgroupns.

Seems okay.

> * should we then also allow setns() without first entering the
> cgroupns-root? setns also checks the same conditions as in (a) plus it
> checks that your current cgroup is descendant of target cgroupns-root.
> Alternatively we can special-case setns() to own cgroupns so that it
> doesn't fail.

I think setns should completely ignore the caller's cgroup and should
not change it. Userspace can do this.

> * migration for these processes will be tricky, if not impossible. But
> the admin trying to do this probably doesn't care about it or will
> provision for it.

Migration for processes in a mntns that have a current directory
outside their mntns is also difficult or impossible. Same with
pidnses with an fd pointing at /proc/self from an outside-the-pid-ns
procfs. Nothing new here.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Namhyung Kim: "Re: [PATCH 1/8] perf, tools: Support handling complete branch stacks as histograms"
Previous message: Dave Jones: "Re: [3.18rc1] btrfs triggering vm bug_on"
In reply to: Aditya Kali: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Next in thread: Aditya Kali: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]