Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Fri May 06 2016 - 10:38:58 EST


On Wed, May 04, 2016 at 11:30:09PM +0000, Serge Hallyn wrote:
> Quoting Djalal Harouni (tixxdz@xxxxxxxxx):
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> >
> > * Update documentation and remove some ambiguity about the feature.
> > Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> >
> >
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> Given your use case, is there any way we could work in some tradeoffs
> to protect the host? What I'm thinking is that containers can all
> share devices uid-mapped at will, however any device mounted with
> uid shifting cannot be used by the inital user namespace. Or maybe
> just non-executable in that case, as you'll need enough access to
> the fs to set up the containers you want to run.
> So if /dev/sda1 is your host /, you have to use /dev/sda2 as the
> container rootfs source. Mount it under /containers with uid
> shifting. Now all containers regardless of uid mappings see
> the shifted fs contents. But the host root cannot be tricked by
> files on it, as /dev/sda2 is non-executable as far as it is
> concerned.
Of course the whole setup is based on the container manager to setup
the right mount namespace, clean mounts, etc then pivot root, boot or

Now I guess we can achieve what you want with MS_SLAVE|MS_REC on / ?

You create a new mount/pid... namespaces with shift flags, but you are still
in init_user_ns, you remount your / with MS_SLAVE|MS_REC, then you
create new mount/pid namespaces with shift flag (two mount namespaces
here if you don't want to race setting MS_SLAVE flag and creating mount
namespace and you don't trust other processes... or you want the same nested

This second new secure mount namespace will be the one that you will use
to setup the container, device nodes, loops... fs that you want into the
container (probably with shift options) and also filesystems that you can't
mount inside user namespaces nor want them to show up or propagate into
host, you may also want to umount stuff too or remount to change mount
options too.., etc anyway here call it the cleaning of the mount namespace.

Now during this phase, when you mount and prepare these file systems,
mount them with noexec flag first, then remount later with exec, or delay
the mounting just before you do a new clone(CLONE_NEWUSER...). During this
phase the container manager should get the device that you want to be
shared from input or argument, and it will only mount it and prepare
it inside new mount namespaces or containers and make sure that it will
never be propagated back...

the user namespace mapping, I guess you drop capabilities, do setuid()
or whatever and start the PID 1 or the app of the container.

Now and to not confuse more Dave, since he doesn't like the idea of
a shared backing device, and me neither for obvious reasons! the shared
device should not be used for a rootfs, maybe for read-only user shared
data, or shared config, that's it... but for real rootfs they should have
their own *different* backing device! unless you know what you are doing
hehe I don't want to confuse people, and I just lack time, will also
respond to Dave email.

> Just a thought.

You think it will solve the case ?

Thanks for your comments!

Djalal Harouni