Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Thu May 05 2016 - 03:37:02 EST

On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > This is version 2 of the VFS:userns support portable root filesystems
> > RFC. Changes since version 1:
> >
> > * Update documentation and remove some ambiguity about the feature.
> > Based on Josh Triplett comments.
> > * Use a new email address to send the RFC :-)
> >
> >
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution.
> > This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> >
> >
> > 1) Presentation:
> > ================
> >
> > The main aim is to support portable root filesystems and allow
> > containers, virtual machines and other cases to use the same root
> > filesystem. Due to security reasons, filesystems can't be mounted
> > inside user namespaces, and mounting them outside will not solve the
> > problem since they will show up with the wrong UIDs/GIDs. Read and
> > write operations will also fail and so on.
> >
> > The current userspace solution is to automatically chown the whole
> > root filesystem before starting a container, example:
> > (host) init_user_ns 1000000:1065536 => (container) user_ns_X1
> > 0:65535
> > (host) init_user_ns 2000000:2065536 => (container) user_ns_Y1
> > 0:65535
> > (host) init_user_ns 3000000:3065536 => (container) user_ns_Z1
> > 0:65535
> > ...
> >
> > Every time a chown is called, files are changed and so on... This
> > prevents to have portable filesystems where you can throw anywhere
> > and boot. Having an extra step to adapt the filesystem to the current
> > mapping and persist it will not allow to verify its integrity, it
> > makes snapshots and migration a bit harder, and probably other
> > limitations...
> >
> > It seems that there are multiple ways to allow user namespaces
> > combine nicely with filesystems, but none of them is that easy. The
> > bind mount and pin the user namespace during mount time will not
> > work, bind mounts share the same super block, hence you may endup
> > working on the wrong vfsmount context and there is no easy way to get
> > out of that...
> So this option was discussed at the recent LSF/MM summit. The most
> supported suggestion was that you'd use a new internal fs type that had
> a struct mount with a new superblock and would copy the underlying
> inodes but substitute it's own with modified ->getatrr/->setattr calls
> that did the uid shift. In many ways it would be a remapping bind
> which would look similar to overlayfs but be a lot simpler.

Hmm, it's not only about ->getattr and ->setattr, you have all the other
file system operations that need access too... which brings two points:

1) This new internal fs may end up doing what this RFC does...

2) or by quoting "new internal fs + its own super block + copy underlying
inodes..." it seems like another overlayfs where you also need some
decisions to copy what, etc. So, will this be really
that light compared to current overlayfs ? not to mention that you need
to hook up basically the same logic or something else inside overlayfs..

> > Using the user namespace in the super block seems the way to go, and
> > there is the "Support fuse mounts in user namespaces" [1] patches
> > which seem nice but perhaps too complex!?
> So I don't think that does what you want. The fuse project I've used
> before to do uid/gid shifts for build containers is bindfs
> It allows a --map argument where you specify pairs of uids/gids to map
> (tedious for large ranges, but the map can be fixed to use uid:range
> instead of individual).

Ok, thanks for the link, will try to take a deep look but bindfs seem
really big!

> > there is also the overlayfs solution, and finaly the VFS layer
> > solution.
> >
> > We present here a simple VFS solution, everything is packed inside
> > VFS, filesystems don't need to know anything (except probably XFS,
> > and special operations inside union filesystems). Currently it
> > supports ext4, btrfs and overlayfs. Changes into filesystems are
> > small, just parse the vfs_shift_uids and vfs_shift_gids options
> > during mount and set the appropriate flags into the super_block
> > structure.
> So this looks a little daunting. It sprays the VFS with knowledge
> about the shifts and requires support from every underlying filesystem.
Well, from my angle, shifts are just user namespace mappings which
follow certain rules, and currently VFS and all filesystems are *already*
doing some kind of shifting... This RFC uses mount namespaces which are
the standard way to deal with mounts, now the mapping inside mount
namespace can just be "inside: 0:1000" => "outside: 0:1000"
and current implementation will just use it, at the same time I'm not
sure if this mapping qualifies to be named "shift". I think that some
folks here came up with the "shift" name to describe one of the use cases
from a user interface that's it... maybe I should do
s/vfs_shift_*/vfs_remap_*/ ?

> A simple remapping bind filesystem would be a lot simpler and require
> no underlying filesystem support.
Yes probably, you still need to parse parameters but not at the
filesystem level, and sure this RFC can do the same of course, but maybe
it's not safe to shift/remap filesystems and their inodes on behalf of
filesystems... and virtual filesystems which can share inodes ?

> James

Thank you!

Djalal Harouni