Re: [RESEND RFC PATCH 1/1] Selectively allow CAP_SYS_NICE capability inside user namespaces

From: Jann Horn
Date: Mon Nov 18 2019 - 14:31:24 EST


On Mon, Nov 18, 2019 at 6:04 PM Prakash Sangappa
<prakash.sangappa@xxxxxxxxxx> wrote:
> Allow CAP_SYS_NICE to take effect for processes having effective uid of a
> root user from init namespace.
[...]
> @@ -4548,6 +4548,8 @@ int can_nice(const struct task_struct *p, const int nice)
> int nice_rlim = nice_to_rlimit(nice);
>
> return (nice_rlim <= task_rlimit(p, RLIMIT_NICE) ||
> + (ns_capable(__task_cred(p)->user_ns, CAP_SYS_NICE) &&
> + uid_eq(current_euid(), GLOBAL_ROOT_UID)) ||
> capable(CAP_SYS_NICE));

I very strongly dislike tying such a feature to GLOBAL_ROOT_UID.
Wouldn't it be better to control this through procfs, similar to
uid_map and gid_map? If you really need an escape hatch to become
privileged outside a user namespace, then I'd much prefer a file
"cap_map" that lets someone with appropriate capabilities in the outer
namespace write a bitmask of capabilities that should have effect
outside the container, or something like that. And limit that to bits
where that's sane, like CAP_SYS_NICE.

If we tie features like this to GLOBAL_ROOT_UID, more people are going
to run their containers with GLOBAL_ROOT_UID. Which is a terrible,
terrible idea. GLOBAL_ROOT_UID gives you privilege over all sorts of
files that you shouldn't be able to access, and only things like mount
namespaces and possibly LSMs prevent you from exercising that
privilege. GLOBAL_ROOT_UID should only ever be given to processes that
you trust completely.