Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)
From: Serge E. Hallyn
Date: Fri Apr 16 2021 - 17:34:57 EST
On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote:
> On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> > (Eric - this patch (v3) is a cleaned up version of the previous approach.
> > v4 is at https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > and is the approach you suggested. I can send it also as a separate patch
> > if you like)
> >
> > A process running as uid 0 but without cap_setfcap currently can simply
> > unshare a new user namespace with uid 0 mapped to 0. While this task
> > will not have new capabilities against the parent namespace, there is
> > a loophole due to the way namespaced file capabilities work. File
> > capabilities valid in userns 1 are distinguised from file capabilities
> > valid in userns 2 by the kuid which underlies uid 0. Therefore
> > the restricted root process can unshare a new self-mapping namespace,
> > add a namespaced file capability onto a file, then use that file
> > capability in the parent namespace.
> >
> > To prevent that, do not allow mapping uid 0 if the process which
> > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > for setting file capabilities.
> >
> > A further wrinkle: a task can unshare its user namespace, then
> > open its uid_map file itself, and map (only) its own uid. In this
> > case we do not have the credential from before unshare, which was
> > potentially more restricted. So, when creating a user namespace, we
> > record whether the creator had CAP_SETFCAP. Then we can use that
> > during map_write().
> >
> > With this patch:
> >
> > 1. unprivileged user can still unshare -Ur
> >
> > ubuntu@caps:~$ unshare -Ur
> > root@caps:~# logout
> >
> > 2. root user can still unshare -Ur
> >
> > ubuntu@caps:~$ sudo bash
> > root@caps:/home/ubuntu# unshare -Ur
> > root@caps:/home/ubuntu# logout
> >
> > 3. root user without CAP_SETFCAP cannot unshare -Ur:
> >
> > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > root@caps:/home/ubuntu# unshare -Ur
> > unshare: write failed /proc/self/uid_map: Operation not permitted
> >
> > Signed-off-by: Serge Hallyn <serge@xxxxxxxxxx>
> >
> > Changelog:
> > * fix logic in the case of writing to another task's uid_map
> > * rename 'ns' to 'map_ns', and make a file_ns local variable
> > * use /* comments */
> > * update the CAP_SETFCAP comment in capability.h
> > * rename parent_unpriv to parent_can_setfcap (and reverse the
> > logic)
> > * remove printks
> > * clarify (i hope) the code comments
> > * update capability.h comment
> > * renamed parent_can_setfcap to parent_could_setfcap
> > * made the check its own disallowed_0_mapping() fn
> > * moved the check into new_idmap_permitted
> > ---
>
> Thank you for working on this fix!
>
> I do prefer your approach of doing the check at user namespace creation
> time instead of moving it into the setxattr() codepath.
>
> Let me reiterate that the ability to write through fscaps is a valid
> usecase and this should continue to work but that for locked down user
> namespace as Andrew wants to use them your patch provides a clean
> solution.
> We've are using identity mappings in quite a few scenarios partially
> when performing tests but also to write through fscaps.
> We also had reports of users that use identity mappings. They create
> their rootfs by running image extraction in an identity mapped userns
> where fscaps are written through.
> Podman has use-cases for this feature as well and has been affected by
> the regression of the first fix.
Thanks for reviewing.
I'm not sure what your point above is, so just to make sure - the
alternative implementation also does allow fscaps for cases where
root uid is remapped, only disallowing it if it would violate the
ancestor's lack of cap_setfcap.
> > include/linux/user_namespace.h | 3 ++
> > include/uapi/linux/capability.h | 3 +-
> > kernel/user_namespace.c | 61 +++++++++++++++++++++++++++++++--
> > 3 files changed, 63 insertions(+), 4 deletions(-)
> >
> > diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> > index 64cf8ebdc4ec..f6c5f784be5a 100644
> > --- a/include/linux/user_namespace.h
> > +++ b/include/linux/user_namespace.h
> > @@ -63,6 +63,9 @@ struct user_namespace {
> > kgid_t group;
> > struct ns_common ns;
> > unsigned long flags;
> > + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> > + * in its effective capability set at the child ns creation time. */
> > + bool parent_could_setfcap;
> >
> > #ifdef CONFIG_KEYS
> > /* List of joinable keyrings in this namespace. Modification access of
> > diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> > index c6ca33034147..2ddb4226cd23 100644
> > --- a/include/uapi/linux/capability.h
> > +++ b/include/uapi/linux/capability.h
> > @@ -335,7 +335,8 @@ struct vfs_ns_cap_data {
> >
> > #define CAP_AUDIT_CONTROL 30
> >
> > -/* Set or remove capabilities on files */
> > +/* Set or remove capabilities on files.
> > + Map uid=0 into a child user namespace. */
> >
> > #define CAP_SETFCAP 31
> >
> > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
> > index af612945a4d0..8c75028a9aae 100644
> > --- a/kernel/user_namespace.c
> > +++ b/kernel/user_namespace.c
> > @@ -106,6 +106,7 @@ int create_user_ns(struct cred *new)
> > if (!ns)
> > goto fail_dec;
> >
> > + ns->parent_could_setfcap = cap_raised(new->cap_effective, CAP_SETFCAP);
> > ret = ns_alloc_inum(&ns->ns);
> > if (ret)
> > goto fail_free;
> > @@ -841,6 +842,56 @@ static int sort_idmaps(struct uid_gid_map *map)
> > return 0;
> > }
> >
> > +/*
> > + * If mapping uid 0, then file capabilities created by the new namespace will
> > + * be effective in the parent namespace. Adding file capabilities requires
> > + * CAP_SETFCAP, which the child namespace will have, so creating such a
> > + * mapping requires CAP_SETFCAP in the parent namespace.
> > + */
> > +static bool disallowed_0_mapping(const struct file *file,
> > + struct user_namespace *map_ns,
> > + struct uid_gid_map *new_map)
> > +{
> > + int idx;
> > + bool zeromapping = false;
> > + const struct user_namespace *file_ns = file->f_cred->user_ns;
> > +
> > + for (idx = 0; idx < new_map->nr_extents; idx++) {
>
> I think having that loop is acceptable here since it's only called once
> at map creation time even though the forward array is not yet sorted.
>
> > + struct uid_gid_extent *e;
> > + u32 lower_first;
> > +
> > + if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> > + e = &new_map->extent[idx];
> > + else
> > + e = &new_map->forward[idx];
> > + if (e->lower_first == 0) {
> > + zeromapping = true;
> > + break;
> > + }
> > + }
> > +
> > + if (!zeromapping)
> > + return false;
> > +
> > + if (map_ns == file_ns) {
> > + /* The user unshared first and is writing to
> > + * /proc/self/uid_map. User already has full
> > + * capabilites in the new namespace, so verify
> > + * that the parent has CAP_SETFCAP. */
> > + if (!file_ns->parent_could_setfcap)
> > + return true;
> > + } else {
> > + /* Process p1 is writing to uid_map of p2, who
> > + * is in a child user namespace to p1's. So
> > + * we verify that p1 has CAP_SETFCAP to its
> > + * own namespace */
> > + if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> > + return true;
> > + }
> > +
> > + return false;
> > +}
>
> Maybe we can tweak this a tiny bit to get rid of the "zeromapping"?:
>
> static bool disallowed_0_mapping(const struct file *file,
> struct user_namespace *map_ns,
> struct uid_gid_map *new_map)
> {
> int idx;
> const struct user_namespace *file_ns = file->f_cred->user_ns;
> struct uid_gid_extent *extent0 = NULL;
>
> for (idx = 0; idx < new_map->nr_extents; idx++) {
> u32 lower_first;
>
> if (new_map->nr_extents <= UID_GID_MAP_MAX_BASE_EXTENTS)
> extent0 = &new_map->extent[idx];
> else
> extent0 = &new_map->forward[idx];
> if (extent0->lower_first == 0)
> break;
>
> extent0 = NULL;
> }
>
> if (!extent0)
> return false;
Feels a little less clear to me, but that's probably just me, so I'll
switch it over, thanks.
>
> if (map_ns == file_ns) {
> /*
> * The user unshared first and is writing to
> * /proc/self/uid_map. User already has full
> * capabilites in the new namespace, so verify
> * that the parent has CAP_SETFCAP.
> */
> if (!file_ns->parent_could_setfcap)
> return true;
> } else {
> /*
> * Process p1 is writing to uid_map of p2, who
> * is in a child user namespace to p1's. So
> * we verify that p1 has CAP_SETFCAP to its
> * own namespace.
> */
> if (!file_ns_capable(file, map_ns->parent, CAP_SETFCAP))
> return true;
> }
>
> return false;
> }
>
> In addition I would think that expressing the logic the other way around
> is more legible. I'm not too keen on having negations in function names.
> We should probably also tweak the comment a bit and make it kernel-doc
> clean:
>
> /**
> * verify_root_map() - check the uid 0 mapping
Hm. restrict_root_map() ? "verify" sounds like we should sometimes reject
it.
> * @file: idmapping file
> * @map_ns: user namespace of the target process
> * @new_map: requested idmap
> *
> * If a process requested a mapping for uid 0 onto
> * uid 0 verify that the process writing the map had the CAP_SETFCAP
> * capability as the target process will be able to
> * write fscaps that are valid in ancestor user namespaces.
> *
> * Return: true if the mapping is allow, false if not.
> */
> static bool verify_root_map()