Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: James Bottomley
Date: Thu May 05 2016 - 07:56:49 EST


On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root
> > > filesystems
> > > RFC. Changes since version 1:
> > >
> > > * Update documentation and remove some ambiguity about the
> > > feature.
> > > Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > >
> > >
> > > This RFC tries to explore how to support filesystem operations
> > > inside user namespace using only VFS and a per mount namespace
> > > solution. This allows to take advantage of user namespace
> > > separations without introducing any change at the filesystems
> > > level. All this is handled with the virtual view of mount
> > > namespaces.
> > >
> > >
> > > 1) Presentation:
> > > ================
> > >
> > > The main aim is to support portable root filesystems and allow
> > > containers, virtual machines and other cases to use the same root
> > > filesystem. Due to security reasons, filesystems can't be mounted
> > > inside user namespaces, and mounting them outside will not solve
> > > the problem since they will show up with the wrong UIDs/GIDs.
> > > Read and write operations will also fail and so on.
> > >
> > > The current userspace solution is to automatically chown the
> > > whole root filesystem before starting a container, example:
> > > (host) init_user_ns 1000000:1065536 => (container) user_ns_X1
> > > 0:65535
> > > (host) init_user_ns 2000000:2065536 => (container) user_ns_Y1
> > > 0:65535
> > > (host) init_user_ns 3000000:3065536 => (container) user_ns_Z1
> > > 0:65535
> > > ...
> > >
> > > Every time a chown is called, files are changed and so on... This
> > > prevents to have portable filesystems where you can throw
> > > anywhere and boot. Having an extra step to adapt the filesystem
> > > to the current mapping and persist it will not allow to verify
> > > its integrity, it makes snapshots and migration a bit harder, and
> > > probably other limitations...
> > >
> > > It seems that there are multiple ways to allow user namespaces
> > > combine nicely with filesystems, but none of them is that easy.
> > > The bind mount and pin the user namespace during mount time will
> > > not work, bind mounts share the same super block, hence you may
> > > endup working on the wrong vfsmount context and there is no easy
> > > way to get out of that...
> >
> > So this option was discussed at the recent LSF/MM summit. The most
> > supported suggestion was that you'd use a new internal fs type that
> > had a struct mount with a new superblock and would copy the
> > underlying inodes but substitute it's own with modified ->getatrr/
> > ->setattr calls that did the uid shift. In many ways it would be a
> > remapping bind which would look similar to overlayfs but be a lot
> > simpler.
>
> Hmm, it's not only about ->getattr and ->setattr, you have all the
> other file system operations that need access too...

Why? Or perhaps we should more cogently define the actual problem. My
problem is simply mounting image volumes that were created with real
uids at user namespace shifted uids because I'm downshifting the
privileged ids in the container. I actually *only* need the uid/gids
on the attributes shifted because that's what I need to manipulate the
volumes. I actually think that other operations, like the file ioctl
ones should, for security reasons, not be uid shifted. For instance
with xfs you could set the panic mask and error tags and bring down the
whole host. What extra things do you need access to and why?

> which brings two points:
>
> 1) This new internal fs may end up doing what this RFC does...

Well that was why I brought it up, yes.

> 2) or by quoting "new internal fs + its own super block + copy
> underlying inodes..." it seems like another overlayfs where you also
> need some decisions to copy what, etc. So, will this be really
> that light compared to current overlayfs ? not to mention that you
> need to hook up basically the same logic or something else inside
> overlayfs..

OK, so forget overlayfs, perhaps that was a bad example. It's like a
uid shifting bind. The way it works is to use shadow inodes (unlike
bind, but because you have to intercept the operations, so it's not a
simple subtree operation) but there's no file copying. The shadow
points to the real inode.

> > > Using the user namespace in the super block seems the way to go,
> > > and there is the "Support fuse mounts in user namespaces" [1]
> > > patches which seem nice but perhaps too complex!?
> >
> > So I don't think that does what you want. The fuse project I've
> > used before to do uid/gid shifts for build containers is bindfs
> >
> > https://github.com/mpartel/bindfs/
> >
> > It allows a --map argument where you specify pairs of uids/gids to
> > map (tedious for large ranges, but the map can be fixed to use
> > uid:range instead of individual).
>
> Ok, thanks for the link, will try to take a deep look but bindfs seem
> really big!

Well, it does a lot more than just uid/gid shift.

> > > there is also the overlayfs solution, and finaly the VFS layer
> > > solution.
> > >
> > > We present here a simple VFS solution, everything is packed
> > > inside VFS, filesystems don't need to know anything (except
> > > probably XFS, and special operations inside union filesystems).
> > > Currently it supports ext4, btrfs and overlayfs. Changes into
> > > filesystems are small, just parse the vfs_shift_uids and
> > > vfs_shift_gids options during mount and set the appropriate flags
> > > into the super_block structure.
> >
> > So this looks a little daunting. It sprays the VFS with knowledge
> > about the shifts and requires support from every underlying
> > filesystem.

> Well, from my angle, shifts are just user namespace mappings which
> follow certain rules, and currently VFS and all filesystems are
> *already* doing some kind of shifting... This RFC uses mount
> namespaces which are the standard way to deal with mounts, now the
> mapping inside mount namespace can just be "inside: 0:1000" =>
> "outside: 0:1000" and current implementation will just use it, at the
> same time I'm not sure if this mapping qualifies to be named "shift".
> I think that some folks here came up with the "shift" name to
> describe one of the use cases from a user interface that's it...
> maybe I should do s/vfs_shift_*/vfs_remap_*/ ?

I don't think the naming is the issue ... it's the spread inside the
vfs code (and in the underlying fs code). The vfs is very well
layered, so touching all that code makes it look like there's a
layering problem with the patch. Touching the underlying fs code looks
even more problematic, but that may be necessary if you have a reason
for wanting the file ioctls, because they're pass through and usually
where the from_kuid() calls are in filesystems.

> > A simple remapping bind filesystem would be a lot simpler and
> > require no underlying filesystem support.
>
> Yes probably, you still need to parse parameters but not at the
> filesystem level,

They'd just be mount options. Basically instead of mount --bind source
target, you'd do mount -t uidshift -o <shift options> source target.

> and sure this RFC can do the same of course, but maybe it's not safe
> to shift/remap filesystems and their inodes on behalf of
> filesystems... and virtual filesystems which can share inodes ?

That depends who you allow to do the shift. Each fstype in the kernel
decides access to mount. For the uidshift, I was planning to allow
only a capable admin in the initial namespace, meaning that only the
admin in the host could set up the shifts. As long as the shifted
filesystem is present, the container can then bind it wherever it wants
in its mount namespace.

James