Re: user namespace and fully visible proc and sys mounts

From: Serge E. Hallyn
Date: Sun Mar 06 2016 - 18:38:20 EST


On Sun, Mar 06, 2016 at 03:53:40PM -0600, Eric W. Biederman wrote:
> "Serge E. Hallyn" <serge.hallyn@xxxxxxxxxx> writes:
>
> > Hi,
> >
> > So we've been over this many times... but unfortunately there is more
> > breakage to report. Regular privileged and unprivileged containers
> > work all right for us. But running an unprivileged container inside a
> > privileged container is blocked.
> >
> > When creating privileged containers, lxc by default does a few things:
> > it mounts some fuse.lxcfs files over procfiles include /proc/meminfo and
> > /proc/uptime. It mounts proc rw but /proc/sysrq-trigger ro as well as
> > moves /proc/sys/net out of the way, bind-mounts /proc/sys readonly
> > (because this container is not in a user namespace) then moves
> > /proc/sys/net back. Finally it mounts sys ro but bind-mounts
> > /sys/devices/virtual/net as writeable.
> >
> > If any of these are left enabled, unprivileged containers can't be
> > started. If all are disabled, then they can be.
> >
> > Can we find a way to make these not block remounts in child user
> > namespaces? A boot flag, a procfs and sysfs mount option, a sysctl?
>
> Are any of these overmounts done for the purpose of security? It

The fuse.lxcfs ones are not for security.

The others are for security, but only in non-user-namespaced containers.
(We're doing them in unprivileged as well for simplicity but could stop
that). We're not overmounting to hide things, we're mounting readonly
because the procfiles are owned by the same uid that is root in the
container. Now in Ubuntu we do also have precise apparmor profiles
which redundantly prevent writing, and our only real goal is to prevent
accidental host damage, but the defense in depth is still nice to have,
and I don't want to drop that.

> appears the /proc/sys and /sys mounts being made read-only is for that
> purpose.

Right, but we're not hiding anything. In fact maybe that's how we
can detect this - if the dentry over- and under-mount for a directory
is the same, ignore it, because it doesn't fall under your original
thread scenario?

> If none of the mounts are for secuirty the easy solution that works
> today is to also mount /proc and /sys somewhere else in your container
> so that the permission check for mounting a new copy passes.

Yeah, we used to do that, and I actually forgot that we used to do that.
I'll have to look into why it no longer suffices.

(The security aspect wasn't too bad, since we used apparmor to prevent any
writes to the redundant mounts)

> That said /proc/sys appears to be a show stopper in this scheme. As the
> root of your privileged container can enter your unprivileged container
> it can bypass your read-only /proc/sys by mounting a new copy of proc if
> we allow the relaxation you are requesting.

Yeah, will have to think about that.

> Therefore the only choice on the table (and I don't have a clue how
> realistic it is) is to have a variant of proc with just files describing
> processes. Call it processfs. That would not need the current
> restrictions.
>
> As for sysfs I am drawing a blank about what might be possible.
>
> Eric