Re: [PATCH 2/5] kernel: add a helper to get an owning user namespace for a namespace

From: Eric W. Biederman
Date: Sun Jul 24 2016 - 10:43:19 EST


Andrew Vagin <avagin@xxxxxxxxxxxxx> writes:

> On Sun, Jul 24, 2016 at 12:03:49AM -0500, Eric W. Biederman wrote:
>> Andrey Vagin <avagin@xxxxxxxxxx> writes:
>>
>> > Return -EPERM if an owning user namespace is outside of a process
>> > current user namespace.
>> >
>> > diff --git a/kernel/user_namespace.c b/kernel/user_namespace.c
>> > index a5bc78c..6382e5e 100644
>> > --- a/kernel/user_namespace.c
>> > +++ b/kernel/user_namespace.c
>> > @@ -994,6 +994,30 @@ static int userns_install(struct nsproxy *nsproxy, struct ns_common *ns)
>> > return commit_creds(cred);
>> > }
>> >
>> > +struct ns_common *ns_get_owner(struct ns_common *ns)
>> > +{
>> > + const struct cred *cred = current_cred();
>> > + struct user_namespace *user_ns, *p;
>> > +
>> > + user_ns = p = ns->user_ns;
>> > + if (user_ns == NULL) { /* ns is init_user_ns */
>> > + /* Unprivileged user should not know that it's init_user_ns. */
>> > + if (capable(CAP_SYS_ADMIN))
>> > + return ERR_PTR(-ENOENT);
>> > + return ERR_PTR(-EPERM);
>> > + }
>>
>> This permission check is not what I meant to request. This does not
>> handle nested user namespaces.
>
> Here I handle a case when ns is init_user_ns. init_user_ns doesn't have
> a parent, so we need to return an error. We can't return ENOENT in all
> cases, because we don't want to expose "that file descriptor is for the
> root user namespace" to unprivileged users.
> (Trevor suggested to add this check and it looks resonable for me
> too).

Apologies. I was skimming and misread the code. I mistook that loop for
some useful part of getting the owner. Looking in more detail...

Your code says:
+struct ns_common *ns_get_owner(struct ns_common *ns)
+{
+ const struct cred *cred = current_cred();
+ struct user_namespace *user_ns, *p;
+
+ user_ns = p = ns->user_ns;
+ if (user_ns == NULL) { /* ns is init_user_ns */
+ /* Unprivileged user should not know that it's init_user_ns. */
+ if (capable(CAP_SYS_ADMIN))
+ return ERR_PTR(-ENOENT);
+ return ERR_PTR(-EPERM);
+ }
+
+ for (;;) {
+ if (p == cred->user_ns)
+ break;
+ if (p == &init_user_ns)
+ return ERR_PTR(-EPERM);
+ p = p->parent;
+ }
+
+ return &get_user_ns(user_ns)->ns;
+}

And all else being equal it could say:

+struct ns_common *ns_get_owner(struct ns_common *ns)
s+{
+ struct user_namespace *user_ns = ns->user_ns;
+
+ /* Are we allowed to see the user namespace? */
+ if (!ns_capable(user_ns?user_ns:&init_user_ns, CAP_SYS_ADMIN))
+ return ERR_PTR(-EPERM);
+
+ if (!user_ns)
+ return ERR_PTR(-ENOENT);
+
+ return &get_user_ns(user_ns)->ns;
+}

Which I think is the root of my confusion. You hand rolled ns_capable,
and I did not recognize it because I was skimming, and just looking to
be certain the permission check was present.

Given that you have to hand roll the pid namespace check that hand
rolling may not be so bad. But it certainly was confusing the first
time I saw it especially without a comment.

Hmm.

I am not at all certain it makes sense to return -ENOENT.

Without the -ENOENT check the code is much cleaner, and clearer.

I may be blinkered but I don't see the value in letting someone know we
are talking about the initial namespace. If anything that information
is likely to cause issues with weird corner cases of checkpoint/restart,
as it acts different if you are in a container or not.

When things act different in a container that almost always is a source
of a problem somewhere.

So we could simplify the filter in the code to just this.

+struct ns_common *ns_get_owner(struct ns_common *ns)
+{
+ struct user_namespace *my_user_ns = current_user_ns();
+ struct user_namespace *owner, *p;
+
+ /* See if the owner is in the current user namespace */
+ owner = p = ns->user_ns;
+ for (;;) {
+ if (!p)
+ return ERR_PTR(-EPERM);
+ if (p == my_user_ns)
+ break;
+ p = p->parent;
+ }
+
+ return &get_user_ns(owner)->ns;
+}

And on reflection I do see the point of not using ns_capable as that
requires having privileges in a namespace while all we want here
is to see if someone is in a visible namespace.

So please ignore my comments about ns_capable on the pid namespace
parent.

But please simplify the loop and put an appropriate comment on it like I
have above. The fewer special cases the easier the code is to get
correct, and the easier it is to read.

Eric