[PATCH] capabilities: Ambient capability set V1

From: Christoph Lameter
Date: Thu Feb 05 2015 - 16:57:10 EST


Ambient caps are something like restricted root privileges.
A process has a set of additional capabilities and those
are inherited without have to set capabilites in other
binaries involved. This allow the partial use of root
like features in a controlled way. It is often useful
to do this for user space device drivers or software that
needs increased priviledges for networking or to control
its own scheduling. Ambient caps allow one to avoid
having to run these with full root priviledges.

Control over this feature is avaialable via a new
prctl option called PR_CAP_AMBIENT. The second argument to prctl
is a the capability number and the third the desired state.
0 for off. Otherwise on.

Ambient bits are enabled regardless of the inheritance
mask of the target binary. They are only restricted
by the bounding set.

History:

Linux capabilities have suffered from the problem that they are not
inheritable like unregular process characteristics under Unix. This is
behavior that is counter intuitive to the expected behavior of processes
in Unix.

In particular there has been recently software that controls NICs from user
space and provides IP stack like behavior also in user space (DPDK and RDMA
kernel API based implementations). Those typically need either capabilities
to allow raw network access or have to be run setsuid. There is scripting and
LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts
including those setting additional capabilites or requiring root access.

That does not go well with having file capabilities set that would enable
the capabilities. Maybe it would work if one would setup capabilities on
all executables but that would also defeat a secure design since these
binaries may only need those caps for certain situations. Ok setting the
inheritable flags on everything may also get one there (if there would not
be the issues with LD_PRELOAD, debugging etc etc).

The easy solution is to allow some capabilities be inherited like setsuid
is. We really prefer to use capabilities instead of setsuid (we want to
limit what damage someone can do after all!). Therefore we have been
running a patch like this in production for the last 6 years. At some
point it becomes tedious to run your own custom kernel so we would like
to have this functionality upstream.

See some of the earlier related discussions on the problems with capability
inheritance:

0. Recent surprise:
https://lkml.org/lkml/2014/1/21/175

1. Attempt to revise caps
http://www.madore.org/~david/linux/newcaps/

2. Problems of passing caps through exec
http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec

3. Problems of binding to privileged ports
http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l

4. Reviving capabilities
http://lwn.net/Articles/199004/

There does not seem to be an alternative on the horizon. Some involved
in security development under Linux have even stated that they want to
rip out the whole thing and replace it. Its been a couple of years now
and we are still suffering from the capabilities mess. Let us just
fix it. Others have already done implementations like this like Nokia
for the N900.


This patch does not change the default behavior but it allows to set up
a list of capabilities via prctl that will enable regular
unix inheritance only for the selected group of capabilities.

With that it is then possible to do something trivial like setting
CAP_NET_RAW on an executable that can then allow that capability to
be inherited by others.

Lets have a look at a coding example of a wrapper that enables
a couple of capabilities:

------------------------------ ambient_test.c
/*
* Test program for the ambient capabilities
*
*
* Compile using:
* gcc -o ambient_test ambient_test.o
*
* This program must have the following capabilities to run properly:
* CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
*
* A command to equip this with the right caps is:
*
* setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test
*
* To get a shell with additional caps that can be inherited do:
*
* ./ambient_test /bin/bash
*
*/

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <sys/prctl.h>
#include <linux/capability.h>

/* Defintion to be updated in the user space include files */
#define PR_CAP_AMBIENT 45

int main(int argc, char **argv)
{
int rc;

if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW))
perror("Cannot set CAP_NET_RAW");

if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN))
perror("Cannot set CAP_NET_ADMIN");

if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE))
perror("Cannot set CAP_SYS_NICE");

printf("Ambient_test forking shell\n");
if (execv(argv[1], argv + 1))
perror("Cannot exec");

return 0;
}
-------------------------------- ambient_test.c

Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN.
With that device raw access is possible and also real time priorities
can be set from user space. This is a frequently needed set of
priviledged operations in HPC and HFT applications. User space
processes need to be able to directly access devices as well as
have full control over scheduling.

Signed-off-by: Christoph Lameter <cl@xxxxxxxxx>

Index: linux/security/commoncap.c
===================================================================
--- linux.orig/security/commoncap.c 2015-02-05 09:53:43.442883383 -0600
+++ linux/security/commoncap.c 2015-02-05 15:40:31.387388142 -0600
@@ -347,14 +347,17 @@ static inline int bprm_caps_from_vfs_cap
*has_cap = true;

CAP_FOR_EACH_U32(i) {
+ __u32 ambient = current_cred()->cap_ambient.cap[i];
__u32 permitted = caps->permitted.cap[i];
__u32 inheritable = caps->inheritable.cap[i];
+ __u32 x = new->cap_bset.cap[i];

/*
- * pP' = (X & fP) | (pI & fI)
+ * pP' = (X & pA) | (X & fP) | (pI & fI)
*/
new->cap_permitted.cap[i] =
- (new->cap_bset.cap[i] & permitted) |
+ (x & ambient) |
+ (x & permitted) |
(new->cap_inheritable.cap[i] & inheritable);

if (permitted & ~new->cap_permitted.cap[i])
@@ -453,9 +456,13 @@ static int get_file_caps(struct linux_bi
if (rc == -EINVAL)
printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
__func__, rc, bprm->filename);
- else if (rc == -ENODATA)
- rc = 0;
- goto out;
+ else if (rc != -ENODATA)
+ goto out;
+ rc = 0;
+ if (cap_isclear(current_cred()->cap_ambient))
+ goto out;
+ /* Make sure that the ambient caps are enabled */
+ *effective = true;
}

rc = bprm_caps_from_vfs_caps(&vcaps, bprm, effective, has_cap);
@@ -577,6 +584,7 @@ skip:
}

new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
+ new->cap_ambient = old->cap_ambient;
return 0;
}

@@ -933,6 +941,23 @@ int cap_task_prctl(int option, unsigned
new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
return commit_creds(new);

+ case PR_CAP_AMBIENT:
+ if (!ns_capable(current_user_ns(), CAP_SETPCAP))
+ return -EPERM;
+
+ if (!cap_valid(arg2))
+ return -EINVAL;
+
+ if (!ns_capable(current_user_ns(), arg2))
+ return -EPERM;
+
+ new = prepare_creds();
+ if (arg3 == 0)
+ cap_lower(new->cap_ambient, arg2);
+ else
+ cap_raise(new->cap_ambient, arg2);
+ return commit_creds(new);
+
default:
/* No functionality available - continue with default */
return -ENOSYS;
Index: linux/include/linux/cred.h
===================================================================
--- linux.orig/include/linux/cred.h 2015-02-05 09:53:43.442883383 -0600
+++ linux/include/linux/cred.h 2015-02-05 09:53:43.438883512 -0600
@@ -122,6 +122,7 @@ struct cred {
kernel_cap_t cap_permitted; /* caps we're permitted */
kernel_cap_t cap_effective; /* caps we can actually use */
kernel_cap_t cap_bset; /* capability bounding set */
+ kernel_cap_t cap_ambient; /* Ambient capability set */
#ifdef CONFIG_KEYS
unsigned char jit_keyring; /* default keyring to attach requested
* keys to */
Index: linux/include/uapi/linux/prctl.h
===================================================================
--- linux.orig/include/uapi/linux/prctl.h 2015-02-05 09:53:43.442883383 -0600
+++ linux/include/uapi/linux/prctl.h 2015-02-05 09:53:43.438883512 -0600
@@ -185,4 +185,7 @@ struct prctl_mm_map {
#define PR_MPX_ENABLE_MANAGEMENT 43
#define PR_MPX_DISABLE_MANAGEMENT 44

+/* Control the ambient capability set */
+#define PR_CAP_AMBIENT 45
+
#endif /* _LINUX_PRCTL_H */
Index: linux/fs/proc/array.c
===================================================================
--- linux.orig/fs/proc/array.c 2014-12-12 10:27:49.304801274 -0600
+++ linux/fs/proc/array.c 2015-02-05 11:04:38.546429870 -0600
@@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file
static inline void task_cap(struct seq_file *m, struct task_struct *p)
{
const struct cred *cred;
- kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
+ kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
+ cap_bset, cap_ambient;

rcu_read_lock();
cred = __task_cred(p);
@@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f
cap_permitted = cred->cap_permitted;
cap_effective = cred->cap_effective;
cap_bset = cred->cap_bset;
+ cap_ambient = cred->cap_ambient;
rcu_read_unlock();

render_cap_t(m, "CapInh:\t", &cap_inheritable);
render_cap_t(m, "CapPrm:\t", &cap_permitted);
render_cap_t(m, "CapEff:\t", &cap_effective);
render_cap_t(m, "CapBnd:\t", &cap_bset);
+ render_cap_t(m, "CapAmb:\t", &cap_ambient);
}

static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/