Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: James Bottomley
Date: Thu May 05 2016 - 18:08:22 EST


On Thu, 2016-05-05 at 22:49 +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> > On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > > This is version 2 of the VFS:userns support portable root
> > > > > filesystems
> > > > > RFC. Changes since version 1:
> > > > >
> > > > > * Update documentation and remove some ambiguity about the
> > > > > feature. Based on Josh Triplett comments.
> > > > > * Use a new email address to send the RFC :-)
> > > > >
> > > > >
> > > > > This RFC tries to explore how to support filesystem
> > > > > operations inside user namespace using only VFS and a per
> > > > > mount namespace solution. This allows to take advantage of
> > > > > user namespace separations without introducing any change at
> > > > > the filesystems level. All this is handled with the virtual
> > > > > view of mount namespaces.
> > > > >
> > > > >
> > > > > 1) Presentation:
> > > > > ================
> > > > >
> > > > > The main aim is to support portable root filesystems and
> > > > > allow containers, virtual machines and other cases to use the
> > > > > same root filesystem. Due to security reasons, filesystems
> > > > > can't be mounted inside user namespaces, and mounting them
> > > > > outside will not solve the problem since they will show up
> > > > > with the wrong UIDs/GIDs. Read and write operations will also
> > > > > fail and so on.
> > > > >
> > > > > The current userspace solution is to automatically chown the
> > > > > whole root filesystem before starting a container, example:
> > > > > (host) init_user_ns 1000000:1065536 => (container)
> > > > > user_ns_X1
> > > > > 0:65535
> > > > > (host) init_user_ns 2000000:2065536 => (container)
> > > > > user_ns_Y1
> > > > > 0:65535
> > > > > (host) init_user_ns 3000000:3065536 => (container)
> > > > > user_ns_Z1
> > > > > 0:65535
> > > > > ...
> > > > >
> > > > > Every time a chown is called, files are changed and so on...
> > > > > This prevents to have portable filesystems where you can
> > > > > throw anywhere and boot. Having an extra step to adapt the
> > > > > filesystem to the current mapping and persist it will not
> > > > > allow to verify its integrity, it makes snapshots and
> > > > > migration a bit harder, and probably other limitations...
> > > > >
> > > > > It seems that there are multiple ways to allow user
> > > > > namespaces combine nicely with filesystems, but none of them
> > > > > is that easy. The bind mount and pin the user namespace
> > > > > during mount time will not work, bind mounts share the same
> > > > > super block, hence you may endup working on the wrong
> > > > > vfsmount context and there is no easy way to get out of
> > > > > that...
> > > >
> > > > So this option was discussed at the recent LSF/MM summit. The
> > > > most supported suggestion was that you'd use a new internal fs
> > > > type that had a struct mount with a new superblock and would
> > > > copy the underlying inodes but substitute it's own with
> > > > modified ->getatrr/->setattr calls that did the uid shift.
> > > > In many ways it would be a remapping bind which would look
> > > > similar to overlayfs but be a lot simpler.
> > >
> > > Hmm, it's not only about ->getattr and ->setattr, you have all
> > > the other file system operations that need access too...
> >
> > Why? Or perhaps we should more cogently define the actual problem.
> > My problem is simply mounting image volumes that were created
> > with real uids at user namespace shifted uids because I'm
> > downshifting the privileged ids in the container. I actually
> > *only* need the uid/gids on the attributes shifted because that's
> > what I need to manipulate the
> >
> We need them obviously for read/write/creation... ?!

OK, so the way attributes are populated on an inode is via getattr.
You intercept that, you change the inode owner and group that are
installed on the inode. That means that when you list the directory,
you see the shift and the shifted uid/gid are used to check permissions
for vfs_open().

> We want to handle also stock filesystems that were never edited
> without depending on any module or third party solution, mounting
> them outside user namespaces, and access inside.

OK, but that's basically my requirements ... you didn't mention any of
the esoteric filesystem ioctls, so I assume from the below you're not
interested in shifting the uids there either?

> > volumes. I actually think that other operations, like the file
> > ioctl ones should, for security reasons, not be uid shifted. For
> > instance with xfs you could set the panic mask and error tags and
> > bring down the whole host. What extra things do you need access to
> > and why?
>
> That's why precisely I said that mounting options not *inside*
> filesystems which means on their back, and on behalf of container
> managers, etc then you are exposed to such scenarios... some virtual
> file systems can also be mounted by unprivileged, how you will deal
> with something like a bind mount on them ?
>
>
> > > which brings two points:
> > >
> > > 1) This new internal fs may end up doing what this RFC does...
> >
> > Well that was why I brought it up, yes.
>
> yes but *with* extra code! that was my point. I'm not sure we need to
> bother with any *new* internal fs type nor hack around dir, file
> operations... yet that has to be shown, defined, coded ... ?

Either way requires patching the kernel. The question I was asking is
is it better to confine the patch to a new fs type or directly change
the vfs.

> > > 2) or by quoting "new internal fs + its own super block + copy
> > > underlying inodes..." it seems like another overlayfs where you
> > > also need some decisions to copy what, etc. So, will this be
> > > really that light compared to current overlayfs ? not to mention
> > > that you need to hook up basically the same logic or something
> > > else inside overlayfs..
> >
> > OK, so forget overlayfs, perhaps that was a bad example. It's like
> > a uid shifting bind. The way it works is to use shadow inodes
> > (unlike bind, but because you have to intercept the operations, so
> > it's not a simple subtree operation) but there's no file copying.
> > The shadow points to the real inode.
>
> For that you need a super block struct for every mount... now if you
> also need a new internal fs + super block + shadowing inodes... it
> seems like you are going into overlayfs direction...

Well, that's the way you build a shadowing fs. I'm not sure you need
one sb per struct vfs mount, but it's certainly possible to code it
that way.

> I'm taking overlayfs as an example here, cause it's just nice and
> really dead simple!
>
> At the same time I'm not at all sure about what you are describing!
> and how you will deal with current mount and bind mounts tree and all
> the internals...

You mean would MS_REC functionality be supported? There's no reason
why not, but there's no reason you have to either (it could even be
optional, like it is for bind).

> > > > Using the user namespace in the super block seems the way to
> > > > go, and there is the "Support fuse mounts in user namespaces"
> > > > [1] patches which seem nice but perhaps too complex!?
> > > >
> > > > So I don't think that does what you want. The fuse project
> > > > I've used before to do uid/gid shifts for build containers is
> > > > bindfs https://github.com/mpartel/bindfs/
> > > >
> > > > It allows a --map argument where you specify pairs of uids/gids
> > > > to map (tedious for large ranges, but the map can be fixed to
> > > > use uid:range instead of individual).
> > >
> > > Ok, thanks for the link, will try to take a deep look but bindfs
> > > seem really big!
> >
> > Well, it does a lot more than just uid/gid shift.
> >
> > > > > there is also the overlayfs solution, and finaly the VFS
> > > > > layer solution.
> > > > >
> > > > > We present here a simple VFS solution, everything is packed
> > > > > inside VFS, filesystems don't need to know anything (except
> > > > > probably XFS, and special operations inside union
> > > > > filesystems). Currently it supports ext4, btrfs and
> > > > > overlayfs. Changes into filesystems are small, just parse the
> > > > > vfs_shift_uids and vfs_shift_gids options during mount and
> > > > > set the appropriate flags into the super_block structure.
> > > >
> > > > So this looks a little daunting. It sprays the VFS with
> > > > knowledge about the shifts and requires support from every
> > > > underlying filesystem.
> >
> > > Well, from my angle, shifts are just user namespace mappings
> > > which follow certain rules, and currently VFS and all filesystems
> > > are *already* doing some kind of shifting... This RFC uses mount
> > > namespaces which are the standard way to deal with mounts, now
> > > the mapping inside mount namespace can just be "inside: 0:1000"
> > > => "outside: 0:1000" and current implementation will just use it,
> > > at the same time I'm not sure if this mapping qualifies to be
> > > named "shift". I think that some folks here came up with the
> > > "shift" name to describe one of the use cases from a user
> > > interface that's it... maybe I should do
> > > s/vfs_shift_*/vfs_remap_*/ ?
> >
> > I don't think the naming is the issue ... it's the spread inside
> > the vfs code (and in the underlying fs code). The vfs is very well
>
> Currently the underlying file systems just parse vfs_shift_uids and
> vfs_shif_gids
>
> > layered, so touching all that code makes it look like there's a
> > layering problem with the patch. Touching the underlying fs code
> > looks
> >
> Hmm, not sure I follow here ? We make use of the mount namespace
> which is part of the whole layer. Actually it's the *standard* way to
> control mounts. What do you mean here please ?

The patch touches a lot of the vfs.

> > even more problematic, but that may be necessary if you have a
> > reason for wanting the file ioctls, because they're pass through
> > and usually where the from_kuid() calls are in filesystems.
>
> Hmm sorry, I'm not sure I'm following you here ?

An ideal solution, given both our requirements, shouldn't require
touching any underlying fs code.

> > > > A simple remapping bind filesystem would be a lot simpler and
> > > > require no underlying filesystem support.
> > >
> > > Yes probably, you still need to parse parameters but not at the
> > > filesystem level,
> >
> > They'd just be mount options. Basically instead of mount --bind
> > source target, you'd do mount -t uidshift -o <shift options> source
> > target.
> >
> > > and sure this RFC can do the same of course, but maybe it's not
> > > safe to shift/remap filesystems and their inodes on behalf of
> > > filesystems... and virtual filesystems which can share inodes ?
> >
> > That depends who you allow to do the shift. Each fstype in the
> > kernel decides access to mount. For the uidshift, I was planning
> > to allow only a capable admin in the initial namespace, meaning
> > that only the admin in the host could set up the shifts. As long
> > as the shifted filesystem is present, the container can then bind
> > it wherever it wants in its mount namespace.
>
> Ah I see admin in initial namespace, yes sounds reasonable for
> security reasons, and how you will be able to achieve the user
> namespace shift?

As I said, it would be in the mount options of the command. The <shift
options> above. Probably parametrised the way uid_map and gid_map are
today.

James