Re: [RFC PATCH 0/0] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Wed May 04 2016 - 06:09:05 EST

Hi Josh,

Thanks for the reply! I'll resend this RFC soon, as it seems
that I receive from mailing lists but can't send...
probably some filters... my domain was dead for a short period...

On Tue, May 03, 2016 at 05:41:07PM -0700, Josh Triplett wrote:
> On Wed, May 04, 2016 at 01:21:46AM +0200, Djalal Harouni wrote:
> > This RFC tries to explore how to support filesystem operations inside
> > user namespace using only VFS and a per mount namespace solution. This
> > allows to take advantage of user namespace separations without
> > introducing any change at the filesystems level. All this is handled
> > with the virtual view of mount namespaces.
> >
> > 1) Presentation:
> > ================
> >
> > The main aim is to support portable root filesystems and allow containers,
> > virtual machines and other cases to use the same root filesystem.
> > Due to security reasons, filesystems can't be mounted inside user
> > namespaces, and mounting them outside will not solve the problem since
> > they will show up with the wrong UIDs/GIDs. Read and write operations
> > will also fail and so on.
> [...]
> > Using the user namespace in the super block seems the way to go, and
> > there is the "Support fuse mounts in user namespaces" [1] patches which
> > seem nice but perhaps too complex!? there is also the overlayfs solution,
> > and finaly the VFS layer solution.
> >
> >
> > We present here a simple VFS solution, everything is packed inside VFS,
> > filesystems don't need to know anything (except probably XFS, and special
> > operations inside union filesystems). Currently it supports ext4, btrfs
> > and overlayfs. Changes into filesystems are small, just parse the
> > vfs_shift_uids and vfs_shift_gids options during mount and set the
> > appropriate flags into the super_block structure.
> Interesting idea, and I certainly like the approach of addressing this
> by mapping UIDs/GIDs within VFS. However, I see a few issues with this:
> > 2) The solution is based on VFS and mount namespaces, we use the user
> > namespace of the containing mount namespace to check if we should shift
> > UIDs/GIDs from/to virtual <=> on-disk view.
> > If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
> > options, and if it shows up inside a mount namespace that supports VFS
> > UIDs/GIDs shifts then during each access we will remap UID/GID either
> > to virtual or to on-disk view using simple helper functions to allow the
> > access. In case the mount or current mount namespace do not support VFS
> > UID/GID shifts, we fallback to the old behaviour, no shift is performed.
> >
> > 3) inodes will always keep their original values which reflect the
> > mapping inside init_user_ns which we consider the on-disk mapping.
> > Therfore they will have a mapping from 0:65536 on-disk, these values are
> > the persistent values that we have to write to the disk. We don't keep
> > track of any UID/GID shift that was applied before. This gives
> > portability and allows to use the previous mapping which was freed for
> > another root filesystem...
> What about filesystems that support 32-bit UIDs/GIDs on disk, which
> includes most modern filesystems, including ext4?
That 0:65536 was just an example, we are not hardcoding anything here, we
just use the mapping provided by user namespaces which supports 32bits.

That plan was just an example where containers will setup the lower
16bits for UID/GID shifts inside the container. This way, they are able
to have the same "virtual" mapping inside the container and to use the
rootfs, where outside they could use the upper 16bits for separation or
as a container ID, or whatever.

> What about nesting, which seems like a perfectly legitimate thing to do?
> (For instance, you want to use a container to run the equivalent of a
> distribution-in-a-chroot, and that distribution internally uses
> containers/namespaces for its own purposes, such as running daemons with
> lower privileges.) The "shift" approach can't support that, because you
> can't give the namespace in the middle permission to "shift" to
> UIDs/GIDs it doesn't control.
I'll confirm this later and respond.

> I do very much like the idea of remapping UIDs/GIDs within VFS.
> However, I'm wondering if it would work better to provide a
> uidmap/gidmap, similar to that used for userns today. In simple cases,
> that should be approximately as efficient as the approach in this patch
> series (if it maps a range of UIDs inside to UIDs outside, it would
> effectively be a bounds-check, an add, and a fallback value for unmapped
> IDs). But you could then nest it the same way you can nest uidmap: as
> root in a namespace, you can map any UID you yourself were given.
> This would also be useful for non-container applications: for instance,
> you could mount a USB disk with an ext4 filesystem and not assume the
> UIDs match those of your host system, while also not squashing them all
> to a single UID the way a uid= option would.
> As with this series, a mapping approach wouldn't require allowing mounts
> inside the namespace. You *could* mount from within the namespace if
> you want and you have appropriate access to do so, mapping UIDs/GIDs on
> the disk to UIDs/GIDs you have in your namespace. However, you could
> also do the mount with mapping from outside the namespace, to non-root
> UIDs/GIDs, and then use those same UIDs/GIDs in a userns mapping to map
> them to root.
> The main design constraint with a full mapping would be passing that
> through "mount". There have been discussions on and off for years about
> replacing the mount() system call with something either two-phase (get
> filesystem driver FD, send it a series of parameters ending with mount;
> the VFS would interpret many of those parameters) or three-phase (get
> filesystem driver FD, send it parameters ending with getting a directory
> FD, bind the directory FD). Given an interface like that, providing a
> UID/GID map at mount time seems plausible.
Could you please provide some links for these discussions ?

I'll get back to it.

> Alternatively, a much simpler approach that could potentially be
> expanded in the future would be to add *two* parameters each for UID and
> GID: a base and a max. That would define a range, which doesn't
> necessarily need to be exactly 2**16; thus, if you had a big enough
> range, that approach would nest as well.
Hm, I can see but I'm not sure if it will make sense, since this
will hardcode the mapping during mount ? where maybe that mount can be
used later for another mapping configuration ? I think we should just
get a user namespace reference and that's it. Now we just allow the
current user namespace interface to do the job for us, and as said above
the 2**16 is just an example.

> - Josh Triplett


Djalal Harouni