Re: [PATCH] capabilities: Ambient capability set V2

From: Serge E. Hallyn
Date: Sat Feb 28 2015 - 23:45:04 EST


On Thu, Feb 26, 2015 at 04:14:33PM -0600, Christoph Lameter wrote:
>
> V1->V2:
> - Fix up the processing of the caps bits after discussions
> with Any and Serge. Make patch less intrusive.
>
> Ambient caps are something like restricted root privileges.
> A process has a set of additional capabilities and those
> are inherited without have to set capabilites in other
> binaries involved. This allow the partial use of root
> like features in a controlled way. It is often useful
> to do this for user space device drivers or software that
> needs increased priviledges for networking or to control
> its own scheduling. Ambient caps allow one to avoid
> having to run these with full root priviledges.
>
> Control over this feature is avaialable via a new
> prctl option called PR_CAP_AMBIENT. The second argument to prctl
> is a the capability number and the third the desired state.
> 0 for off. Otherwise on.
>
> Ambient bits are enabled regardless of the inheritance
> mask of the target binary. They are only restricted
> by the bounding set.
>
> History:
>
> Linux capabilities have suffered from the problem that they are not
> inheritable like unregular process characteristics under Unix. This is
> behavior that is counter intuitive to the expected behavior of processes
> in Unix.
>
> In particular there has been recently software that controls NICs from user
> space and provides IP stack like behavior also in user space (DPDK and RDMA
> kernel API based implementations). Those typically need either capabilities
> to allow raw network access or have to be run setsuid. There is scripting and
> LD_PREFLOAD etc involved, arbitrary binaries may be run from those scripts
> including those setting additional capabilites or requiring root access.
>
> That does not go well with having file capabilities set that would enable
> the capabilities. Maybe it would work if one would setup capabilities on
> all executables but that would also defeat a secure design since these
> binaries may only need those caps for certain situations. Ok setting the
> inheritable flags on everything may also get one there (if there would not
> be the issues with LD_PRELOAD, debugging etc etc).
>
> The easy solution is to allow some capabilities be inherited like setsuid
> is. We really prefer to use capabilities instead of setsuid (we want to
> limit what damage someone can do after all!). Therefore we have been
> running a patch like this in production for the last 6 years. At some
> point it becomes tedious to run your own custom kernel so we would like
> to have this functionality upstream.
>
> See some of the earlier related discussions on the problems with capability
> inheritance:
>
> 0. Recent surprise:
> https://lkml.org/lkml/2014/1/21/175
>
> 1. Attempt to revise caps
> http://www.madore.org/~david/linux/newcaps/
>
> 2. Problems of passing caps through exec
> http://unix.stackexchange.com/questions/128394/passing-capabilities-through-exec
>
> 3. Problems of binding to privileged ports
> http://stackoverflow.com/questions/413807/is-there-a-way-for-non-root-processes-to-bind-to-privileged-ports-1024-on-l
>
> 4. Reviving capabilities
> http://lwn.net/Articles/199004/
>
> There does not seem to be an alternative on the horizon. Some involved
> in security development under Linux have even stated that they want to
> rip out the whole thing and replace it. Its been a couple of years now
> and we are still suffering from the capabilities mess. Let us just
> fix it. Others have already done implementations like this like Nokia
> for the N900.
>
>
> This patch does not change the default behavior but it allows to set up
> a list of capabilities via prctl that will enable regular
> unix inheritance only for the selected group of capabilities.
>
> With that it is then possible to do something trivial like setting
> CAP_NET_RAW on an executable that can then allow that capability to
> be inherited by others.
>
> Lets have a look at a coding example of a wrapper that enables
> a couple of capabilities:
>
> ------------------------------ ambient_test.c
> /*
> * Test program for the ambient capabilities
> *
> *
> * Compile using:
> * gcc -o ambient_test ambient_test.o
> *
> * This program must have the following capabilities to run properly:
> * CAP_SETPCAP, CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
> *
> * A command to equip this with the right caps is:
> *
> * setcap cap_setpcap,cap_net_raw,cap_net_admin,cap_sys_nice+eip ambient_test
> *
> * To get a shell with additional caps that can be inherited do:
> *
> * ./ambient_test /bin/bash
> *
> */
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <errno.h>
> #include <sys/prctl.h>
> #include <linux/capability.h>
>
> /* Defintion to be updated in the user space include files */
> #define PR_CAP_AMBIENT 45
>
> int main(int argc, char **argv)
> {
> int rc;
>
> if (prctl(PR_CAP_AMBIENT, CAP_NET_RAW))
> perror("Cannot set CAP_NET_RAW");
>
> if (prctl(PR_CAP_AMBIENT, CAP_NET_ADMIN))
> perror("Cannot set CAP_NET_ADMIN");
>
> if (prctl(PR_CAP_AMBIENT, CAP_SYS_NICE))
> perror("Cannot set CAP_SYS_NICE");
>

Your example program is not filling in pI though?

Ah, i see why. In get_file_caps() you are still assigning

fP = pA

if the file has no file capabilities. so then you are actually
doing

pP' = (X & (fP | pA)) | (pI & (fI | pA))
rather than
pP' = (X & fP) | (pI & (fI | pA))

Other than that, the patch is looking good to me. We should
consider emitting an audit record when a task fills in its
pA, and I do still wonder whether we should be requiring
CAP_SETFCAP (unsure how best to think of it). But assuming the
fP = pA was not intended, I think this largely does the right
thing.

> printf("Ambient_test forking shell\n");
> if (execv(argv[1], argv + 1))
> perror("Cannot exec");
>
> return 0;
> }
> -------------------------------- ambient_test.c
>
> Allows the inheritance of CAP_SYS_NICE, CAP_NET_RAW and CAP_NET_ADMIN.
> With that device raw access is possible and also real time priorities
> can be set from user space. This is a frequently needed set of
> priviledged operations in HPC and HFT applications. User space
> processes need to be able to directly access devices as well as
> have full control over scheduling.
>
> Signed-off-by: Christoph Lameter <cl@xxxxxxxxx>
>
> Index: linux/security/commoncap.c
> ===================================================================
> --- linux.orig/security/commoncap.c 2015-02-25 13:43:06.929973954 -0600
> +++ linux/security/commoncap.c 2015-02-26 16:10:02.347913397 -0600
> @@ -347,15 +347,17 @@ static inline int bprm_caps_from_vfs_cap
> *has_cap = true;
>
> CAP_FOR_EACH_U32(i) {
> + __u32 ambient = current_cred()->cap_ambient.cap[i];
> __u32 permitted = caps->permitted.cap[i];
> __u32 inheritable = caps->inheritable.cap[i];
>
> /*
> - * pP' = (X & fP) | (pI & fI)
> + * pP' = (X & fP) | (pI & (fI | pA))
> */
> new->cap_permitted.cap[i] =
> (new->cap_bset.cap[i] & permitted) |
> - (new->cap_inheritable.cap[i] & inheritable);
> + (new->cap_inheritable.cap[i] &
> + (inheritable | ambient));
>
> if (permitted & ~new->cap_permitted.cap[i])
> /* insufficient to execute correctly */
> @@ -453,8 +455,18 @@ static int get_file_caps(struct linux_bi
> if (rc == -EINVAL)
> printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> __func__, rc, bprm->filename);
> - else if (rc == -ENODATA)
> + else if (rc == -ENODATA) {
> rc = 0;
> + if (!cap_isclear(current_cred()->cap_ambient)) {
> + /*
> + * The ambient caps are permitted for
> + * files that have no caps
> + */
> + bprm->cred->cap_permitted =
> + current_cred()->cap_ambient;
> + *effective = true;
> + }
> + }
> goto out;
> }
>
> @@ -549,9 +561,20 @@ skip:
> new->sgid = new->fsgid = new->egid;
>
> if (effective)
> + /*
> + * pE' = pP' & (fE | pA)
> + *
> + * fE is implicity all set if effective == true.
> + * Therefore the above reduces to
> + *
> + * pE' = pP'
> + */
> new->cap_effective = new->cap_permitted;
> else
> cap_clear(new->cap_effective);
> +
> + /* pA' = pA */
> + new->cap_ambient = old->cap_ambient;
> bprm->cap_effective = effective;
>
> /*
> @@ -566,7 +589,7 @@ skip:
> * Number 1 above might fail if you don't have a full bset, but I think
> * that is interesting information to audit.
> */
> - if (!cap_isclear(new->cap_effective)) {
> + if (!cap_issubset(new->cap_effective, new->cap_ambient)) {
> if (!cap_issubset(CAP_FULL_SET, new->cap_effective) ||
> !uid_eq(new->euid, root_uid) || !uid_eq(new->uid, root_uid) ||
> issecure(SECURE_NOROOT)) {
> @@ -598,7 +621,7 @@ int cap_bprm_secureexec(struct linux_bin
> if (!uid_eq(cred->uid, root_uid)) {
> if (bprm->cap_effective)
> return 1;
> - if (!cap_isclear(cred->cap_permitted))
> + if (!cap_issubset(cred->cap_permitted, cred->cap_ambient))
> return 1;
> }
>
> @@ -933,6 +956,23 @@ int cap_task_prctl(int option, unsigned
> new->securebits &= ~issecure_mask(SECURE_KEEP_CAPS);
> return commit_creds(new);
>
> + case PR_CAP_AMBIENT:
> + if (!ns_capable(current_user_ns(), CAP_SETPCAP))
> + return -EPERM;
> +
> + if (!cap_valid(arg2))
> + return -EINVAL;
> +
> + if (!ns_capable(current_user_ns(), arg2))
> + return -EPERM;
> +
> + new = prepare_creds();
> + if (arg3 == 0)
> + cap_lower(new->cap_ambient, arg2);
> + else
> + cap_raise(new->cap_ambient, arg2);
> + return commit_creds(new);
> +
> default:
> /* No functionality available - continue with default */
> return -ENOSYS;
> Index: linux/include/linux/cred.h
> ===================================================================
> --- linux.orig/include/linux/cred.h 2015-02-25 13:43:06.929973954 -0600
> +++ linux/include/linux/cred.h 2015-02-25 13:43:06.925972078 -0600
> @@ -122,6 +122,7 @@ struct cred {
> kernel_cap_t cap_permitted; /* caps we're permitted */
> kernel_cap_t cap_effective; /* caps we can actually use */
> kernel_cap_t cap_bset; /* capability bounding set */
> + kernel_cap_t cap_ambient; /* Ambient capability set */
> #ifdef CONFIG_KEYS
> unsigned char jit_keyring; /* default keyring to attach requested
> * keys to */
> Index: linux/include/uapi/linux/prctl.h
> ===================================================================
> --- linux.orig/include/uapi/linux/prctl.h 2015-02-25 13:43:06.929973954 -0600
> +++ linux/include/uapi/linux/prctl.h 2015-02-25 13:43:06.925972078 -0600
> @@ -185,4 +185,7 @@ struct prctl_mm_map {
> #define PR_MPX_ENABLE_MANAGEMENT 43
> #define PR_MPX_DISABLE_MANAGEMENT 44
>
> +/* Control the ambient capability set */
> +#define PR_CAP_AMBIENT 45
> +
> #endif /* _LINUX_PRCTL_H */
> Index: linux/fs/proc/array.c
> ===================================================================
> --- linux.orig/fs/proc/array.c 2015-02-25 13:43:06.929973954 -0600
> +++ linux/fs/proc/array.c 2015-02-25 13:43:06.925972078 -0600
> @@ -302,7 +302,8 @@ static void render_cap_t(struct seq_file
> static inline void task_cap(struct seq_file *m, struct task_struct *p)
> {
> const struct cred *cred;
> - kernel_cap_t cap_inheritable, cap_permitted, cap_effective, cap_bset;
> + kernel_cap_t cap_inheritable, cap_permitted, cap_effective,
> + cap_bset, cap_ambient;
>
> rcu_read_lock();
> cred = __task_cred(p);
> @@ -310,12 +311,14 @@ static inline void task_cap(struct seq_f
> cap_permitted = cred->cap_permitted;
> cap_effective = cred->cap_effective;
> cap_bset = cred->cap_bset;
> + cap_ambient = cred->cap_ambient;
> rcu_read_unlock();
>
> render_cap_t(m, "CapInh:\t", &cap_inheritable);
> render_cap_t(m, "CapPrm:\t", &cap_permitted);
> render_cap_t(m, "CapEff:\t", &cap_effective);
> render_cap_t(m, "CapBnd:\t", &cap_bset);
> + render_cap_t(m, "CapAmb:\t", &cap_ambient);
> }
>
> static inline void task_seccomp(struct seq_file *m, struct task_struct *p)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/