Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Thu May 05 2016 - 17:50:21 EST


On Thu, May 05, 2016 at 07:56:28AM -0400, James Bottomley wrote:
> On Thu, 2016-05-05 at 08:36 +0100, Djalal Harouni wrote:
> > On Wed, May 04, 2016 at 05:06:19PM -0400, James Bottomley wrote:
> > > On Wed, 2016-05-04 at 16:26 +0200, Djalal Harouni wrote:
> > > > This is version 2 of the VFS:userns support portable root
> > > > filesystems
> > > > RFC. Changes since version 1:
> > > >
> > > > * Update documentation and remove some ambiguity about the
> > > > feature.
> > > > Based on Josh Triplett comments.
> > > > * Use a new email address to send the RFC :-)
> > > >
> > > >
> > > > This RFC tries to explore how to support filesystem operations
> > > > inside user namespace using only VFS and a per mount namespace
> > > > solution. This allows to take advantage of user namespace
> > > > separations without introducing any change at the filesystems
> > > > level. All this is handled with the virtual view of mount
> > > > namespaces.
> > > >
> > > >
> > > > 1) Presentation:
> > > > ================
> > > >
> > > > The main aim is to support portable root filesystems and allow
> > > > containers, virtual machines and other cases to use the same root
> > > > filesystem. Due to security reasons, filesystems can't be mounted
> > > > inside user namespaces, and mounting them outside will not solve
> > > > the problem since they will show up with the wrong UIDs/GIDs.
> > > > Read and write operations will also fail and so on.
> > > >
> > > > The current userspace solution is to automatically chown the
> > > > whole root filesystem before starting a container, example:
> > > > (host) init_user_ns 1000000:1065536 => (container) user_ns_X1
> > > > 0:65535
> > > > (host) init_user_ns 2000000:2065536 => (container) user_ns_Y1
> > > > 0:65535
> > > > (host) init_user_ns 3000000:3065536 => (container) user_ns_Z1
> > > > 0:65535
> > > > ...
> > > >
> > > > Every time a chown is called, files are changed and so on... This
> > > > prevents to have portable filesystems where you can throw
> > > > anywhere and boot. Having an extra step to adapt the filesystem
> > > > to the current mapping and persist it will not allow to verify
> > > > its integrity, it makes snapshots and migration a bit harder, and
> > > > probably other limitations...
> > > >
> > > > It seems that there are multiple ways to allow user namespaces
> > > > combine nicely with filesystems, but none of them is that easy.
> > > > The bind mount and pin the user namespace during mount time will
> > > > not work, bind mounts share the same super block, hence you may
> > > > endup working on the wrong vfsmount context and there is no easy
> > > > way to get out of that...
> > >
> > > So this option was discussed at the recent LSF/MM summit. The most
> > > supported suggestion was that you'd use a new internal fs type that
> > > had a struct mount with a new superblock and would copy the
> > > underlying inodes but substitute it's own with modified ->getatrr/
> > > ->setattr calls that did the uid shift. In many ways it would be a
> > > remapping bind which would look similar to overlayfs but be a lot
> > > simpler.
> >
> > Hmm, it's not only about ->getattr and ->setattr, you have all the
> > other file system operations that need access too...
>
> Why? Or perhaps we should more cogently define the actual problem. My
> problem is simply mounting image volumes that were created with real
> uids at user namespace shifted uids because I'm downshifting the
> privileged ids in the container. I actually *only* need the uid/gids
> on the attributes shifted because that's what I need to manipulate the

We need them obviously for read/write/creation... ?! We want to handle
also stock filesystems that were never edited without depending on any
module or third party solution, mounting them outside user namespaces,
and access inside.

> volumes. I actually think that other operations, like the file ioctl
> ones should, for security reasons, not be uid shifted. For instance
> with xfs you could set the panic mask and error tags and bring down the
> whole host. What extra things do you need access to and why?

That's why precisely I said that mounting options not *inside*
filesystems which means on their back, and on behalf of container
managers, etc then you are exposed to such scenarios... some virtual
file systems can also be mounted by unprivileged, how you will deal
with something like a bind mount on them ?


> > which brings two points:
> >
> > 1) This new internal fs may end up doing what this RFC does...
>
> Well that was why I brought it up, yes.

yes but *with* extra code! that was my point. I'm not sure we need to
bother with any *new* internal fs type nor hack around dir, file
operations... yet that has to be shown, defined, coded ... ?


> > 2) or by quoting "new internal fs + its own super block + copy
> > underlying inodes..." it seems like another overlayfs where you also
> > need some decisions to copy what, etc. So, will this be really
> > that light compared to current overlayfs ? not to mention that you
> > need to hook up basically the same logic or something else inside
> > overlayfs..
>
> OK, so forget overlayfs, perhaps that was a bad example. It's like a
> uid shifting bind. The way it works is to use shadow inodes (unlike
> bind, but because you have to intercept the operations, so it's not a
> simple subtree operation) but there's no file copying. The shadow
> points to the real inode.

For that you need a super block struct for every mount... now if you
also need a new internal fs + super block + shadowing inodes... it
seems like you are going into overlayfs direction...

I'm taking overlayfs as an example here, cause it's just nice and
really dead simple!

At the same time I'm not at all sure about what you are describing!
and how you will deal with current mount and bind mounts tree and all
the internals...



> > > > Using the user namespace in the super block seems the way to go,
> > > > and there is the "Support fuse mounts in user namespaces" [1]
> > > > patches which seem nice but perhaps too complex!?
> > >
> > > So I don't think that does what you want. The fuse project I've
> > > used before to do uid/gid shifts for build containers is bindfs
> > >
> > > https://github.com/mpartel/bindfs/
> > >
> > > It allows a --map argument where you specify pairs of uids/gids to
> > > map (tedious for large ranges, but the map can be fixed to use
> > > uid:range instead of individual).
> >
> > Ok, thanks for the link, will try to take a deep look but bindfs seem
> > really big!
>
> Well, it does a lot more than just uid/gid shift.
>
> > > > there is also the overlayfs solution, and finaly the VFS layer
> > > > solution.
> > > >
> > > > We present here a simple VFS solution, everything is packed
> > > > inside VFS, filesystems don't need to know anything (except
> > > > probably XFS, and special operations inside union filesystems).
> > > > Currently it supports ext4, btrfs and overlayfs. Changes into
> > > > filesystems are small, just parse the vfs_shift_uids and
> > > > vfs_shift_gids options during mount and set the appropriate flags
> > > > into the super_block structure.
> > >
> > > So this looks a little daunting. It sprays the VFS with knowledge
> > > about the shifts and requires support from every underlying
> > > filesystem.
>
> > Well, from my angle, shifts are just user namespace mappings which
> > follow certain rules, and currently VFS and all filesystems are
> > *already* doing some kind of shifting... This RFC uses mount
> > namespaces which are the standard way to deal with mounts, now the
> > mapping inside mount namespace can just be "inside: 0:1000" =>
> > "outside: 0:1000" and current implementation will just use it, at the
> > same time I'm not sure if this mapping qualifies to be named "shift".
> > I think that some folks here came up with the "shift" name to
> > describe one of the use cases from a user interface that's it...
> > maybe I should do s/vfs_shift_*/vfs_remap_*/ ?
>
> I don't think the naming is the issue ... it's the spread inside the
> vfs code (and in the underlying fs code). The vfs is very well

Currently the underlying file systems just parse vfs_shift_uids and
vfs_shif_gids


> layered, so touching all that code makes it look like there's a
> layering problem with the patch. Touching the underlying fs code looks

Hmm, not sure I follow here ? We make use of the mount namespace which
is part of the whole layer. Actually it's the *standard* way to control
mounts. What do you mean here please ?


> even more problematic, but that may be necessary if you have a reason
> for wanting the file ioctls, because they're pass through and usually
> where the from_kuid() calls are in filesystems.

Hmm sorry, I'm not sure I'm following you here ?


> > > A simple remapping bind filesystem would be a lot simpler and
> > > require no underlying filesystem support.
> >
> > Yes probably, you still need to parse parameters but not at the
> > filesystem level,
>
> They'd just be mount options. Basically instead of mount --bind source
> target, you'd do mount -t uidshift -o <shift options> source target.
>
> > and sure this RFC can do the same of course, but maybe it's not safe
> > to shift/remap filesystems and their inodes on behalf of
> > filesystems... and virtual filesystems which can share inodes ?
>
> That depends who you allow to do the shift. Each fstype in the kernel
> decides access to mount. For the uidshift, I was planning to allow
> only a capable admin in the initial namespace, meaning that only the
> admin in the host could set up the shifts. As long as the shifted
> filesystem is present, the container can then bind it wherever it wants
> in its mount namespace.

Ah I see admin in initial namespace, yes sounds reasonable for security
reasons, and how you will be able to achieve the user namespace shift ?


> James
>

Thank you!


--
Djalal Harouni
http://opendz.org