Re: [kernel-hardening] Re: [PATCH resend 2/2] userns: control capabilities of some user namespaces
From: Serge E. Hallyn
Date: Mon Nov 06 2017 - 22:23:18 EST
On Mon, Nov 06, 2017 at 09:16:03PM -0500, Daniel Micay wrote:
> On Mon, 2017-11-06 at 16:14 -0600, Serge E. Hallyn wrote:
> > Quoting Daniel Micay (danielmicay@xxxxxxxxx):
> > > Substantial added attack surface will never go away as a problem.
> > > There
> > > aren't a finite number of vulnerabilities to be found.
> >
> > There's varying levels of usefulness and quality. There is code which
> > I
> > want to be able to use in a container, and code which I can't ever see
> > a
> > reason for using there. The latter, especially if it's also in a
> > staging driver, would be nice to have a toggle to disable.
> >
> > You're not advocating dropping the added attack surface, only adding a
> > way of dealing with an 0day after the fact. Privilege raising 0days
> > can
> > exist anywhere, not just in code which only root in a user namespace
> > can
> > exercise. So from that point of view, ksplice seems a more complete
> > solution. Why not just actually fix the bad code block when we know
> > about it?
>
> That's not what I'm advocating. I only care about it for proactive
> attack surface reduction downstream. I have no interest in using it to
> block access to known vulnerabilities.
>
> > Finally, it has been well argued that you can gain many new caps from
> > having only a few others. Given that, how could you ever be sure
> > that,
> > if an 0day is found which allows root in a user ns to abuse
> > CAP_NET_ADMIN against the host, just keeping CAP_NET_ADMIN from them
> > would suffice?
>
> I didn't suggest using it that way...
>
> > It seems to me that the existing control in
> > /proc/sys/kernel/unprivileged_userns_clone might be the better duct
> > tape
> > in that case.
>
> There's no such thing as unprivileged_userns_clone in mainline.
Hm. I was sure Kees had gotten that in... I guess I was wrong.
> The advantage of this over unprivileged_userns_clone in Debian and maybe
> some other distributions is not giving up unprivileged app containers /
> sandboxes implemented via user namespaces. For example, Chromium's user
> namespace sandbox likely only needs to have CAP_SYS_CHROOT. Chromium
> will be dropping their setuid sandbox, forcing usage of user namespaces
> to avoid losing the sandbox which will greatly increase local kernel
> attack surface on the host by exposing netfilter management, etc. to
> unprivileged users.
>
> The proposed approach isn't necessarily the best way to implement this
> kind of mitigation but I think it's filling a real need.
I think I definately prefer what I mentioned in the email to Boris.
Basically a "permanent capability bounding set". The normal bounding
set gets reset to a full set on every new user_ns creation. In this
proposal, it would instead be set to the calling task's permanent
capability set, which starts (at boot) full, and which privileged
tasks can pull capabilities out of.
-serge