Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support

From: Andy Lutomirski
Date: Mon Oct 20 2014 - 20:21:14 EST

Next message: Eric Rost: "[PATCH] staging: skein: Loadable Module Support"
Previous message: Chris Zhong: "Re: [PATCH 3/4] pinctrl: rockchip: Parse pin groups before calling pinctrl_register()"
In reply to: Eric W.Biederman: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Next in thread: Eric W. Biederman: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Oct 19, 2014 at 9:55 PM, Eric W.Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>
>
> On October 19, 2014 1:26:29 PM CDT, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>On Sat, Oct 18, 2014 at 10:23 PM, Eric W. Biederman
>><ebiederm@xxxxxxxxxxxx> wrote:
>>> "Serge E. Hallyn" <serge@xxxxxxxxxx> writes:
>>>
>>>> Quoting Aditya Kali (adityakali@xxxxxxxxxx):
>>>>> On Thu, Oct 16, 2014 at 2:12 PM, Serge E. Hallyn <serge@xxxxxxxxxx>
>>wrote:
>>>>> > Quoting Aditya Kali (adityakali@xxxxxxxxxx):
>>>>> >> setns on a cgroup namespace is allowed only if
>>>>> >> * task has CAP_SYS_ADMIN in its current user-namespace and
>>>>> >> over the user-namespace associated with target cgroupns.
>>>>> >> * task's current cgroup is descendent of the target
>>cgroupns-root
>>>>> >> cgroup.
>>>>> >
>>>>> > What is the point of this?
>>>>> >
>>>>> > If I'm a user logged into
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope and I start
>>>>> > a container which is in
>>>>> > /lxc/c1/user.slice/user-1000.slice/session-c12.scope/x1
>>>>> > then I will want to be able to enter the container's cgroup.
>>>>> > The container's cgroup root is under my own (satisfying the
>>>>> > below condition0 but my cgroup is not a descendent of the
>>>>> > container's cgroup.
>>>>> >
>>>>> This condition is there because we don't want to do implicit cgroup
>>>>> changes when a process attaches to another cgroupns. cgroupns tries
>>to
>>>>> preserve the invariant that at any point, your current cgroup is
>>>>> always under the cgroupns-root of your cgroup namespace. But in
>>your
>>>>> example, if we allow a process in "session-c12.scope" container to
>>>>> attach to cgroupns root'ed at "session-c12.scope/x1" container
>>>>> (without implicitly moving its cgroup), then this invariant won't
>>>>> hold.
>>>>
>>>> Oh, I see. Guess that should be workable. Thanks.
>>>
>>> Which has me looking at what the rules are for moving through
>>> the cgroup hierarchy.
>>>
>>> As long as we have write access to cgroup.procs and are allowed
>>> to open the file for write, we can move any of our own tasks
>>> into the cgroup. So the cgroup namespace rules don't seem
>>> to be a problem.
>>>
>>> Andy can you please take a look at the permission checks in
>>> __cgroup_procs_write.
>>
>>The actual requirements for calling that function haven't changed,
>>right? IOW, what does this have to do with cgroupns?
>
> Excluding user namespaces the requirements have not changed.
>
> The immediate correlation is that to enter a cgroupns you must first put your process in one of it's cgroups.
>
> So I was examining what it would take to enter the cgroup of cgroupns.
>
>> Is the idea
>>that you want a privileged user wrt a cgroupns's userns to be able to
>>use this? If so:
>>
>>Yes, that current_cred() thing is bogus. (Actually, this is probably
>>exploitable right now if any cgroup.procs inode anywhere on the system
>>lets non-root write.) (Can we have some kernel debugging option that
>>makes any use of current_cred() in write(2) warn?)
>>
>>We really need a weaker version of may_ptrace for this kind of stuff.
>>Maybe the existing may_ptrace stuff is okay, actually. But this is
>>completely missing group checks, cap checks, capabilities wrt the
>>userns, etc.
>>
>>Also, I think that, if this version of the patchset allows non-init
>>userns to unshare cgroupns, then the issue of what permission is
>>needed to lock the cgroup hierarchy like that needs to be addressed,
>>because unshare(CLONE_NEWUSER|CLONE_NEWCGROUP) will effectively pin
>>the calling task with no permission required. Bolting on a fix later
>>will be a mess.
>
> I imagine the pinning would be like the userns.
>
> Ah but there is a potentially serious issue with the pinning.
> With pinning we can make it impossible for root to move us to a different cgroup.
>
> I am not certain how serious that is but it bears thinking about.
> If we don't implement pinning we should be able to implent everything with just filesystem mount options, and no new namespace required.
>
> Sigh.
>
> I am too tired tonight to see the end game in this.

Possible solution:

Ditch the pinning. That is, if you're outside a cgroupns (or you have
a non-ns-confined cgroupfs mounted), then you can move a task in a
cgroupns outside of its root cgroup. If you do this, then the task
thinks its cgroup is something like "../foo" or "../../foo".

While we're at it, consider making setns for a cgroupns *not* change
the caller's cgroup. Is there any reason it really needs to?

Thoughts?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Rost: "[PATCH] staging: skein: Loadable Module Support"
Previous message: Chris Zhong: "Re: [PATCH 3/4] pinctrl: rockchip: Parse pin groups before calling pinctrl_register()"
In reply to: Eric W.Biederman: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Next in thread: Eric W. Biederman: "Re: [PATCHv1 7/8] cgroup: cgroup namespace setns support"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]