Re: [PATCH] cgroup: Add new capability to allow a process to migrate other tasks between cgroups

From: John Stultz
Date: Tue Oct 18 2016 - 12:54:54 EST


On Tue, Oct 18, 2016 at 1:17 AM, Michael Kerrisk (man-pages)
<mtk.manpages@xxxxxxxxx> wrote:
> Hi John,
>
> On 18 October 2016 at 01:35, John Stultz <john.stultz@xxxxxxxxxx> wrote:
>> On Mon, Oct 17, 2016 at 3:40 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Mon, Oct 17, 2016 at 3:35 PM, John Stultz <john.stultz@xxxxxxxxxx> wrote:
>>>> This patch adds CAP_GROUP_MIGRATE and logic to allows a process
>>>> to migrate other tasks between cgroups.
>>>>
>>>> In Android (where this feature originated), the ActivityManager tracks
>>>> various application states (TOP_APP, FOREGROUND, BACKGROUND, SYSTEM,
>>>> etc), and then as applications change states, the SchedPolicy logic
>>>> will migrate the application tasks between different cgroups used
>>>> to control the different application states (for example, there is a
>>>> background cpuset cgroup which can limit background tasks to stay
>>>> on one low-power cpu, and the bg_non_interactive cpuctrl cgroup can
>>>> then further limit those background tasks to a small percentage of
>>>> that one cpu's cpu time).
>>>>
>>>> However, for security reasons, Android doesn't want to make the
>>>> system_server (the process that runs the ActivityManager and
>>>> SchedPolicy logic), run as root. So in the Android common.git
>>>> kernel, they have some logic to allow cgroups to loosen their
>>>> permissions so CAP_SYS_NICE tasks can migrate other tasks between
>>>> cgroups.
>>>>
>>>> The approach taken there overloads CAP_SYS_NICE a bit much, and
>>>> is maybe more complicated then needed.
>>>>
>>>> So this patch, as suggested by Tejun, simply adds a new process
>>>> capability flag (CAP_CGROUP_MIGRATE), and uses it when checking
>>>> if a task can migrate other tasks between cgroups.
>>>>
>>>> I've tested this with AOSP master (though its a bit hacked in as I
>>>> still need to properly get the selinux bits aware of the new
>>>> capability bit) with selinux set to permissive and it seems to be
>>>> working well.
>>>>
>>>> Thoughts and feedback would be appreciated!
>>>>
>>>> Cc: Tejun Heo <tj@xxxxxxxxxx>
>>>> Cc: Li Zefan <lizefan@xxxxxxxxxx>
>>>> Cc: Jonathan Corbet <corbet@xxxxxxx>
>>>> Cc: cgroups@xxxxxxxxxxxxxxx
>>>> Cc: Android Kernel Team <kernel-team@xxxxxxxxxxx>
>>>> Cc: Rom Lemarchand <romlem@xxxxxxxxxxx>
>>>> Cc: Colin Cross <ccross@xxxxxxxxxxx>
>>>> Cc: Dmitry Shmidt <dimitrysh@xxxxxxxxxx>
>>>> Cc: Ricky Zhou <rickyz@xxxxxxxxxxxx>
>>>> Cc: Dmitry Torokhov <dmitry.torokhov@xxxxxxxxx>
>>>> Cc: Todd Kjos <tkjos@xxxxxxxxxx>
>>>> Cc: Christian Poetzsch <christian.potzsch@xxxxxxxxxx>
>>>> Cc: Amit Pundir <amit.pundir@xxxxxxxxxx>
>>>> Cc: Serge E. Hallyn <serge@xxxxxxxxxx>
>>>> Cc: linux-api@xxxxxxxxxxxxxxx
>>>> Signed-off-by: John Stultz <john.stultz@xxxxxxxxxx>
>>>> ---
>>>> v2: Renamed to just CAP_CGROUP_MIGRATE as reccomended by Tejun
>>>> ---
>>>> include/uapi/linux/capability.h | 5 ++++-
>>>> kernel/cgroup.c | 3 ++-
>>>> 2 files changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
>>>> index 49bc062..44d7ff4 100644
>>>> --- a/include/uapi/linux/capability.h
>>>> +++ b/include/uapi/linux/capability.h
>>>> @@ -349,8 +349,11 @@ struct vfs_cap_data {
>>>>
>>>> #define CAP_AUDIT_READ 37
>>>>
>>>> +/* Allow migrating tasks between cgroups */
>>>>
>>>> -#define CAP_LAST_CAP CAP_AUDIT_READ
>>>> +#define CAP_CGROUP_MIGRATE 38
>>>> +
>>>> +#define CAP_LAST_CAP CAP_CGROUP_MIGRATE
>>>>
>>>> #define cap_valid(x) ((x) >= 0 && (x) <= CAP_LAST_CAP)
>>>>
>>>> diff --git a/kernel/cgroup.c b/kernel/cgroup.c
>>>> index 85bc9be..09f84d2 100644
>>>> --- a/kernel/cgroup.c
>>>> +++ b/kernel/cgroup.c
>>>> @@ -2856,7 +2856,8 @@ static int cgroup_procs_write_permission(struct task_struct *task,
>>>> */
>>>> if (!uid_eq(cred->euid, GLOBAL_ROOT_UID) &&
>>>> !uid_eq(cred->euid, tcred->uid) &&
>>>> - !uid_eq(cred->euid, tcred->suid))
>>>> + !uid_eq(cred->euid, tcred->suid) &&
>>>> + !ns_capable(tcred->user_ns, CAP_CGROUP_MIGRATE))
>>>> ret = -EACCES;
>>>
>>> This logic seems rather confused to me. Without this patch, a user
>>> can write to procs if it's root *or* it matches the target uid *or* it
>>> matches the target suid. How does this make sense? How about
>>> ptrace_may_access(...) || ns_capable(tcred->user_ns,
>>> CAP_CGROUP_MIGRATE)?
>>
>> Though ptrace_may_access would open it also to apps with
>> CAP_SYS_PTRACE as well, no?
>>
>> Would pulling out from __ptrace_may_access the:
>> if (uid_eq(caller_uid, tcred->euid) &&
>> uid_eq(caller_uid, tcred->suid) &&
>> uid_eq(caller_uid, tcred->uid) &&
>> gid_eq(caller_gid, tcred->egid) &&
>> gid_eq(caller_gid, tcred->sgid) &&
>> gid_eq(caller_gid, tcred->gid))
>> goto ok;
>>
>> check and creating a new helper that could be shared between them be
>> the right approach?
>
> So, is creating a new capability here necessarily the right approach?
> Is this operation so unique, or is there an existing silo (not
> CAP_SYS_ADMIN) that we can re-use? I ask, because we currently use 38
> silos out of a possible 64 capabilities, and when everyone chooses
> single-use capabilities, we will quickly exhaust the silos.

Agreed this is a concern, and CGROUP_MIGRATE is maybe too narrow of a
specification for something so limited.

> I'm not saying that creating a new capability here is wrong, but it is
> worth further considering the existing silos to see if there is one
> that is a suitable match.
>
> Looking at http://man7.org/linux/man-pages/man7/capabilities.7.html
> throws up the following possibilities:
>
> CAP_SYS_NICE

Again, for Android uses, CAP_SYS_NICE would be fine (ideal really),
but I worry that it might be too commonly given in other systems to
allow a task to migrate potential cgroup restrictions in container
focused environments.

> CAP_SYS_PTRACE

For Android, PTRACE requires too much privilege given to the
controlling task, as that would allow the system_server to also be
able to inspect memory of all other tasks, which raises security
concerns. (We already went through this with the
proc/<tid>/timerslack_ns interface, and had to move back to
CAP_SYS_NICE there).


> CAP_SYS_RESOURCE
>
> I'm aware that you said above that use CAP_SYS_NICE overloads that
> capability a bit too much. Maybe it's true, but on the other hand, by
> my count from dome rough grepping of the kernel source, there are a
> total of 14 capable() checks for CAP_SYS_NICE, out of a total of
> around 1256 capable() checks altogether. So, I think this does need to
> be balanced against the limited number of silos.
>
> Also, CAP_SYS_RESOURCE deserves consideration (34 uses in capable()
> checks). I'd say, since cgroups are about resources, so there's
> something of a match there., so it's also worth considering.

I'll try to look into CAP_SYS_RESOURCE.

Colin/Todd: Any objection from the Android side on CAP_SYS_RESOURCE?

(Or we could just create a new 512bit CAP2_ capabilities interface! :P)

thanks
-john