Re: [PATCH v2 1/4] capability: add ns_capable_cred()

From: Christian Brauner
Date: Sat May 02 2020 - 11:13:47 EST


On Sat, May 02, 2020 at 09:52:03AM -0500, Eric W. Biederman wrote:
> Christian Brauner <christian.brauner@xxxxxxxxxx> writes:
>
> > On Sat, May 02, 2020 at 07:35:53AM -0500, Eric W. Biederman wrote:
> >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes:
> >>
> >> > On Thu, Apr 30, 2020 at 01:09:30PM -0500, Eric W. Biederman wrote:
> >> >> Christian Brauner <christian.brauner@xxxxxxxxxx> writes:
> >> >>
> >> >> > Add a simple capability helper which makes it possible to determine
> >> >> > whether a set of creds is ns capable wrt to the passed in credentials.
> >> >> > This is not something exciting it's just a more pleasant wrapper around
> >> >> > security_capable() by allowing ns_capable_common() to ake a const struct
> >> >> > cred argument. In ptrace_has_cap() for example, we're using
> >> >> > security_capable() directly. ns_capable_cred() will be used in the next
> >> >> > patch to check against the target credentials the caller is going to
> >> >> > switch to.
> >> >>
> >> >> Given that this is to suppot setns. I don't understand the
> >> >> justification for this.
> >> >>
> >> >> Is it your intention to use the reduced permissions that you get
> >> >> when you install a user namespace?
> >> >
> >> > Indeed.
> >> >
> >> >>
> >> >> Why do you want to use the reduced permissions when installing multiple
> >> >> namespaces at once?
> >> >
> >> > The intention is to use the target credentials we are going to install
> >> > when setns() hits the point of no return. The target permissions are
> >> > either the permissions of the caller or the reduced permissions if
> >> > CLONE_NEWUSER has been requested. This has multiple reasons.
> >> >
> >> > The most obvious reason imho is that all other namespaces have an owning
> >> > user namespace. Attaching to any other namespace requires the attacher
> >> > to be privileged over the owning user namespace of that namespace.
> >> > Consequently, all our current install handlers for every namespace we
> >> > have check whether we are privileged in the owning user namespace of
> >> > that user namespace. So in order to attach to any of those namespaces -
> >> > especially when attaching as an unprivileged user - requires that we are
> >> > attached to the user namespace first.
> >>
> >> No actually it doesn't. As if you have privileges to attach to the user
> >> namespace you have the privileges to attach to anything it owns. Or you
> >> should I think I am missing some subtle detail at the moment.
> >
> > Sorry, I phrased that very misleadingly. What I'm talking about should
> > be obvious in the second patch:
> >
> > -static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
> > +static int utsns_install(struct nsset *nsset, struct ns_common *new)
> > {
> > + struct nsproxy *nsproxy = nsset->nsproxy;
> > struct uts_namespace *ns = to_uts_ns(new);
> >
> > - if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > - !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > + if (!ns_capable_cred(nsset->cred, ns->user_ns, CAP_SYS_ADMIN) ||
> > + !ns_capable_cred(nsset->cred, nsset->cred->user_ns, CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > All our current install handlers check that we're privileged wrt to our
> > _current_ user namespace, i.e. they all have a
> > ns_capable(current_user_ns(), CAP_SYS_ADMIN)
> > line. So if we specify
> > setns(CLONE_NEWUSER | CLONE_NEW*)
> > as an unprivileged user we aren't able to attach to any other namespace
> > unless we check against the target credentials.
> >
> >>
> >>
> >> > (That's especially useful given
> >> > that most users especially container runtimes will unshare all
> >> > namespaces. Doing it this way we can not just have attach privileged
> >> > users attach to their containers but also unprivileged users to their
> >> > containers in one shot.)
> >>
> >> That is a wonderful reason for doing things, and it is the reason
> >> why I am asking about it because I think you have it backwards.
> >>
> >> Especially in the context of some container runtimes like Kubernetes
> >> that I have been told will do things like share a network namespace
> >> across all containers in a POD.
> >
> > Kubernetes doesn't use user namespace at all so they need to run as root
> > anyway so that's not a problem. And if we're talking about scenarios
> > where some namespaces are shared and other's aren't then there'll be
> > batch-attaching any, i.e. not all at once but some at once. So I don't
> > think this is a good argument.
> >
> >>
> >> > A few other points about this. If one looks at clone(CLONE_NEW*) or
> >> > unshare(CLONE_NEW*) then the ordering and permissions checking is the
> >> > same way. All permissions checks are performed against the reduced
> >> > permissions, i.e. if CLONE_NEWUSER is specified you check privilege
> >> > against the reduced permissions too otherwise you wouldn't be able to
> >> > spawn into a complete set of new namespaces as an unprivileged user.
> >>
> >> That is a good catch and definitely a reason for looking at doing
> >> things in this order.
> >>
> >> For unshare and clone putting things in a user namespace means you can
> >> create namespaces you could not create otherwise.
> >>
> >>
> >> > This logic is also expressed in how setns() is already used in
> >> > userspace. Any container runtime will attach to the user namespace first,
> >> > so all subsequent calls to attach to other namespaces perform the checks
> >> > against the reduced permissions. It also has to be that way because of
> >> > fully unprivileged containers.
> >>
> >> So I sat and looked. For nsetner it winds up trying to enter
> >> the namespaces in either order.
> >>
> >> /*
> >> * Now that we know which namespaces we want to enter, enter
> >> * them. Do this in two passes, not entering the user
> >> * namespace on the first pass. So if we're deprivileging the
> >> * container we'll enter the user namespace last and if we're
> >> * privileging it then we enter the user namespace first
> >> * (because the initial setns will fail).
> >> */
> >> for (pass = 0; pass < 2; pass ++) {
> >> for (nsfile = namespace_files + 1 - pass; nsfile->nstype; nsfile++) {
> >> if (nsfile->fd < 0)
> >> continue;
> >> if (nsfile->nstype == CLONE_NEWPID && do_fork == -1)
> >> do_fork = 1;
> >> if (setns(nsfile->fd, nsfile->nstype)) {
> >> if (pass != 0)
> >> err(EXIT_FAILURE,
> >> _("reassociate to namespace '%s' failed"),
> >> nsfile->name);
> >> else
> >> continue;
> >> }
> >>
> >> close(nsfile->fd);
> >> nsfile->fd = -1;
> >> }
> >> }
> >>
> >>
> >> Looking a little close we have at least for entering the mntns the
> >> following checks:
> >>
> >> if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> >> !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> >> !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> >> return -EPERM;
> >>
> >> Which require CAP_SYS_CHROOT and CAP_SYS_ADMIN in the current user namespace.
> >>
> >> So there is defintiely an issue there.
> >>
> >> I suspect the clean approach is to simply require CAP_SYS_CHROOT in
> >> the new user namespace, when we are setting several of them at once.
> >
> > See my example above. All install handlers check for
> > ns_capable(current_user_ns(), CAP_SYS_ADMIN).
> >
> >
> > static int ipcns_install(struct nsproxy *nsproxy, struct ns_common *new)
> > {
> > struct ipc_namespace *ns = to_ipc_ns(new);
> > if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > return -EPERM;
> >
> >
> > static int netns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > struct net *net = to_net_ns(ns);
> >
> > if (!ns_capable(net->user_ns, CAP_SYS_ADMIN) ||
> > !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > static int mntns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > struct fs_struct *fs = current->fs;
> > struct mnt_namespace *mnt_ns = to_mnt_ns(ns), *old_mnt_ns;
> > struct path root;
> > int err;
> >
> > if (!ns_capable(mnt_ns->user_ns, CAP_SYS_ADMIN) ||
> > !ns_capable(current_user_ns(), CAP_SYS_CHROOT) ||
> > !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > static int cgroupns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > struct cgroup_namespace *cgroup_ns = to_cg_ns(ns);
> >
> > if (!ns_capable(current_user_ns(), CAP_SYS_ADMIN) ||
> > !ns_capable(cgroup_ns->user_ns, CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > static int utsns_install(struct nsproxy *nsproxy, struct ns_common *new)
> > {
> > struct uts_namespace *ns = to_uts_ns(new);
> >
> > if (!ns_capable(ns->user_ns, CAP_SYS_ADMIN) ||
> > !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > return -EPERM;
> >
> > static int pidns_install(struct nsproxy *nsproxy, struct ns_common *ns)
> > {
> > struct pid_namespace *active = task_active_pid_ns(current);
> > struct pid_namespace *ancestor, *new = to_pid_ns(ns);
> >
> > if (!ns_capable(new->user_ns, CAP_SYS_ADMIN) ||
> > !ns_capable(current_user_ns(), CAP_SYS_ADMIN))
> > return -EPERM;
> >
> >>
> >> > To put it another way, if we were to always perform the permission checks
> >> > against the current permissions (i.e. no matter if CLONE_NEWUSER is
> >> > specified or not) then we'd never be able to attach to a set of
> >> > namespaces at once as an unprivileged user.
> >> > We also really want to be able to express both semantics:
> >> > 1. setns(flags & ~CLONE_NEWUSER) --> attach to all namespaces with my
> >> > current permission level
> >> > 2. setns(flags | CLONE_NEWUSER) attach to all namespaces with the target
> >> > permission level
> >> > It feels weird if both 1 and 2 would mean the exact same thing given
> >> > that the user namespace has an owernship relation with all the other
> >> > namespaces.
> >>
> >> It feels weird to me to disallow anything that we have permissions for.
> >>
> >> Can you dig up the actual install permissions checks and see if there is
> >> anything other than the mount namespace that needs permissions in the
> >> current user namespace?
> >
> > Yep did, all of them check ns_capable(current_user_ns(), CAP_SYS_ADMIN),
> > see above please.
> >
> >>
> >> Please let's walk this through. I think there should be a way to
> >> carefully phrase the permission checks that we don't need
> >> ns_capable_cred that will allow goofy cases like setns into Kuberneties
> >> PODs that share network namespaces.
> >
> > Hm, Kubernetes doesn't use user namespace. I think I misunderstand your
> > concern maybe. But see below for a suggestion.
> >
> >>
> >> I believe that will be a way to phrase the permission checks so that
> >> with or without CLONE_NEWUSER they make sense, and give very similar
> >> results.
> >>
> >> Certainly attaching to a mount namespace is going to need either being
> >> root or attaching to a user namespace at the same time. Because
> >> attaching to a mount namespace needs functionality that the user
> >> namespace provides.
> >
> > So how about we add a new flag to setns()
> > SETNS_NEWUSER_CRED that is only valid in
> > conjunction with CLONE_NEWUSER and which allows callers to tell the
> > kernel "assume the target credentials first". This way we can support
> > both cases and users can specify what they want and we can clearly
> > document it on the manpages.
>
> Let's not get complicated. Let's make this very simple.

I'd call it "flexible". :)

>
> Change "ns_capable(current_user_user_ns(), CAP_SYS_ADMIN)"
> to "ns_capable(nsset->cred->user_ns, CAP_SYS_ADMIN)".
>
> We still use the current credentials, but we require the caller of setns
> to have CAP_SYS_ADMIN in the user namespace you are going to wind up in.
> That is exactly what userns_install does today.
>
> That won't be any kind of security hole because we still require
> "ns_capable(ns->user_ns, CAP_SYS_ADMIN)"
> on all of the namespaces as well.
>
> That way if you are sufficiently privileged you can still handle joining
> namespaces that are not owned by the owner of the processes user
> namespace. So we can join weird processes.
>
> Does that make sense?

Well, it's funny that you say that since I had a version of that
patchset and I still have it at setns_pidfd_v2_wip_v5. I justed pushed
it to
https://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git/log/?h=setns_pidfd_v2_wip_v5
The reason I refrained from doing this was fear of introducing a
security hole but since we agree that this would be fine let's give it a
go.
In any case, I think I never ran the test-suite I added on that version.
Let me plug this in and test this.

Christian