Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Serge Hallyn
Date: Mon May 09 2016 - 12:27:14 EST


Quoting Djalal Harouni (tixxdz@xxxxxxxxx):
> Hi,
>
> On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote:
> > Quoting Djalal Harouni (tixxdz@xxxxxxxxx):
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > >
> > > * Update documentation and remove some ambiguity about the feature.
> > > Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > >
> > >
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> >
> > Given your use case, is there any way we could work in some tradeoffs
> > to protect the host? What I'm thinking is that containers can all
> > share devices uid-mapped at will, however any device mounted with
> > uid shifting cannot be used by the inital user namespace. Or maybe
> > just non-executable in that case, as you'll need enough access to
> > the fs to set up the containers you want to run.
> >
> > So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> > container rootfs source. Mount it under /containers with uid
> > shifting. Now all containers regardless of uid mappings see
> > the shifted fs contents. But the host root cannot be tricked by
> > files on it, as /dev/sda2 is non-executable as far as it is
> > concerned.
> Of course the whole setup is based on the container manager to setup
> the right mount namespace, clean mounts, etc then pivot root, boot or
> whatever...
>
> Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?
>
> You create a new mount/pid... namespaces with shift flags, but you are still
> in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
> create new mount/pid namespaces with shift flag (two mount namespaces
> here if you don't want to race setting MS_SLAVE flag and creating mount
> namespace and you don't trust other processes... or you want the same nested
> setup...)
>
> This second new secure mount namespace will be the one that you will use
> to setup the container, device nodes, loops... fs that you want into the
> container (probably with shift options) and also filesystems that you can't
> mount inside user namespaces nor want them to show up or propagate into
> host, you may also want to umount stuff too or remount to change mount
> options too.., etc anyway here call it the cleaning of the mount namespace.
>
> Now during this phase, when you mount and prepare these file systems,
> mount them with noexec flag first, then remount later with exec, or delay
> the mounting just before you do a new clone(CLONE_NEWUSER...). During this
> phase the container manager should get the device that you want to be
> shared from input or argument, and it will only mount it and prepare
> it inside new mount namespaces or containers and make sure that it will
> never be propagated back...
>
> After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> the user namespace mapping, I guess you drop capabilities, do setuid()
> or whatever and start the PID 1 or the app of the container.
>
> Now and to not confuse more Dave, since he doesn't like the idea of
> a shared backing device, and me neither for obvious reasons! the shared
> device should not be used for a rootfs, maybe for read-only user shared
> data, or shared config, that's it... but for real rootfs they should have
> their own *different* backing device! unless you know what you are doing
> hehe I don't want to confuse people, and I just lack time, will also
> respond to Dave email.

Yes. We're saying slightly different things. You're saying that the admin
should assign different backing stores for containers. I'm saying perhaps
the kernel should enforce that, because $leaks. Let's say the host admin
did a perfect setup of a container with shifted uids. Now he wants to
run a quick ps in the container... he does it in a way that leaks a
/proc/pid reference into the container so that (evil) container root can
use /proc/pid/root/ to get a toehold into the host /. Does he now have
shifted access to that?

I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
then immediately that blockdev becomes not-readable (or not-executable)
in any namespace which does not have /proc/$pid/ns/user as an ancestor.
With obvious check as in write-versus-execute exclusion that you cannot
mark the blockdev shifted if ancestor user_ns already has a file open for
execute.

BTW, perhaps I should do this in a separate email, but here is how I would
expect to use this:

1. Using zfs: I create a bare (unshifted) rootfs fs1. When I want to
create a new container, I zfs clone fs1 to fs2, and let the container
use fs2 shifted. No danger to fs1 since fs2 is cow. Same with btrfs.

2. Using overlay: I create a bare (unshifted) rootfs fs1. When I want
to create a new container, I I mount fs1 read-only and shifted as base
layer, then fs2 as the rw layer.

The point here is that the zfs clone plus container start takes (for a
600-800M rootfs) about .5 seconds on my laptop, while the act of shifting
all the uids takes another 2 seconds. So being able do this without
manually shifting would be a huge improvement for cases (i.e. docker)
where you do lots and lots of quick deploys.

> > Just a thought.
>
> You think it will solve the case ?
>
>
> Thanks for your comments!
>
> --
> Djalal Harouni
> http://opendz.org