Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces
From: Serge E. Hallyn
Date: Mon Nov 06 2017 - 22:28:09 EST
On Mon, Nov 06, 2017 at 07:01:58PM -0500, Boris Lukashev wrote:
> On Mon, Nov 6, 2017 at 6:39 PM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> > Quoting Boris Lukashev (blukashev@xxxxxxxxxxxxxxxx):
> >> On Mon, Nov 6, 2017 at 5:14 PM, Serge E. Hallyn <serge@xxxxxxxxxx> wrote:
> >> > Quoting Daniel Micay (danielmicay@xxxxxxxxx):
> >> >> Substantial added attack surface will never go away as a problem. There
> >> >> aren't a finite number of vulnerabilities to be found.
> >> >
> >> > There's varying levels of usefulness and quality. There is code which I
> >> > want to be able to use in a container, and code which I can't ever see a
> >> > reason for using there. The latter, especially if it's also in a
> >> > staging driver, would be nice to have a toggle to disable.
> >> >
> >> > You're not advocating dropping the added attack surface, only adding a
> >> > way of dealing with an 0day after the fact. Privilege raising 0days can
> >> > exist anywhere, not just in code which only root in a user namespace can
> >> > exercise. So from that point of view, ksplice seems a more complete
> >> > solution. Why not just actually fix the bad code block when we know
> >> > about it?
> >> >
> >> > Finally, it has been well argued that you can gain many new caps from
> >> > having only a few others. Given that, how could you ever be sure that,
> >> > if an 0day is found which allows root in a user ns to abuse
> >> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> >> > would suffice? It seems to me that the existing control in
> >> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct tape
> >> > in that case.
> >> >
> >> > -serge
> >>
> >> This seems to be heading toward "we need full zones in Linux" with
> >> their own procfs and sysfs namespace and a stricter isolation model
> >> for resources and capabilities. So long as things can happen in a
> >> namespace which have a privileged relationship with host resources,
> >> this is going to be cat-and-mouse to one degree or another.
> >>
> >> Containers and namespaces dont have a one-to-one relationship, so i'm
> >> not sure that's the best term to use in the kernel security context
> >
> > Sorry - what's not the best term to use?
>
> Pardon, "containers," since they're namespaces+system construct.
>
> >
> >> since there's a bunch of userspace and implementation delta across the
> >> different systems (with their own security models and so forth).
> >> Without accounting for what a specific implementation may or may not
> >> do, and only looking at "how do we reduce privileged impact on parent
> >> context from unprivileged namespaces," this patch does seem to provide
> >> a logical way of reducing the privileges available in such a namespace
> >> and often needed to mount escapes/impact parent context.
> >
> > What different implementations do is irrelevant - as an unprivileged user
> > I can always, with no help, create a new user namespace mapping my current
> > uid to root, and exercise this code. So the security model implemented
> > by a particular userspace namespace-using driver doesn't matter, as it
> > only restricts me if I choose to use it.
> >
> > But, I guess you're actually saying that some program might know that it
> > should never use network code so want to drop CAP_NET_*? And you're
> > saying that a "global capability bounding set" might be useful?
> >
>
> The "global capability bounding set" with forced inheritance can be
> used to prevent the vector you describe wherein the capability of UID
> 0 in the child NS is restricted from the parent implicitly, so yes,
> that nomenclature seems appropriate.
>
> > Would it be better to actually implement it as a new bounding set that
> > is maintained across user namespace creations, but is per-task (inherted
> > by children of course)? Instead of a sysctl?
> >
> > -serge
>
> In line with the previous comment, the inheritance across subsequent
> invocations should be forced to prevent the context you described.
> Please pardon my ignorance, not sure what you mean in terms of
> "per-task" across namespace creation.
I meant each task has a perm_cap_bset next to the cap_bset. So task
p1 (if it has privilege) can drop CAP_SYS_ADMIN from perm_cap_bset,
p2 (if it has privilege) can drop CAP_NET_ADMIN. When p1 creates a
new user_ns, that init task has its cap_bset set to all caps but
CAP_SYS_ADMIN.
I think for simplicity perm_cap_bset would *only* affect the filling
of cap_bset at user namespace creation. So if you wanted to drop a
capability from your own cap_bset as well, you'd have to do that
separately.