Re: [RFC PATCH 0/0] VFS:userns: support portable root filesystems

From: Josh Triplett
Date: Tue May 03 2016 - 20:41:31 EST

On Wed, May 04, 2016 at 01:21:46AM +0200, Djalal Harouni wrote:
> This RFC tries to explore how to support filesystem operations inside
> user namespace using only VFS and a per mount namespace solution. This
> allows to take advantage of user namespace separations without
> introducing any change at the filesystems level. All this is handled
> with the virtual view of mount namespaces.
> 1) Presentation:
> ================
> The main aim is to support portable root filesystems and allow containers,
> virtual machines and other cases to use the same root filesystem.
> Due to security reasons, filesystems can't be mounted inside user
> namespaces, and mounting them outside will not solve the problem since
> they will show up with the wrong UIDs/GIDs. Read and write operations
> will also fail and so on.
> Using the user namespace in the super block seems the way to go, and
> there is the "Support fuse mounts in user namespaces" [1] patches which
> seem nice but perhaps too complex!? there is also the overlayfs solution,
> and finaly the VFS layer solution.
> We present here a simple VFS solution, everything is packed inside VFS,
> filesystems don't need to know anything (except probably XFS, and special
> operations inside union filesystems). Currently it supports ext4, btrfs
> and overlayfs. Changes into filesystems are small, just parse the
> vfs_shift_uids and vfs_shift_gids options during mount and set the
> appropriate flags into the super_block structure.

Interesting idea, and I certainly like the approach of addressing this
by mapping UIDs/GIDs within VFS. However, I see a few issues with this:

> 2) The solution is based on VFS and mount namespaces, we use the user
> namespace of the containing mount namespace to check if we should shift
> UIDs/GIDs from/to virtual <=> on-disk view.
> If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
> options, and if it shows up inside a mount namespace that supports VFS
> UIDs/GIDs shifts then during each access we will remap UID/GID either
> to virtual or to on-disk view using simple helper functions to allow the
> access. In case the mount or current mount namespace do not support VFS
> UID/GID shifts, we fallback to the old behaviour, no shift is performed.
> 3) inodes will always keep their original values which reflect the
> mapping inside init_user_ns which we consider the on-disk mapping.
> Therfore they will have a mapping from 0:65536 on-disk, these values are
> the persistent values that we have to write to the disk. We don't keep
> track of any UID/GID shift that was applied before. This gives
> portability and allows to use the previous mapping which was freed for
> another root filesystem...

What about filesystems that support 32-bit UIDs/GIDs on disk, which
includes most modern filesystems, including ext4?

What about nesting, which seems like a perfectly legitimate thing to do?
(For instance, you want to use a container to run the equivalent of a
distribution-in-a-chroot, and that distribution internally uses
containers/namespaces for its own purposes, such as running daemons with
lower privileges.) The "shift" approach can't support that, because you
can't give the namespace in the middle permission to "shift" to
UIDs/GIDs it doesn't control.

I do very much like the idea of remapping UIDs/GIDs within VFS.
However, I'm wondering if it would work better to provide a
uidmap/gidmap, similar to that used for userns today. In simple cases,
that should be approximately as efficient as the approach in this patch
series (if it maps a range of UIDs inside to UIDs outside, it would
effectively be a bounds-check, an add, and a fallback value for unmapped
IDs). But you could then nest it the same way you can nest uidmap: as
root in a namespace, you can map any UID you yourself were given.

This would also be useful for non-container applications: for instance,
you could mount a USB disk with an ext4 filesystem and not assume the
UIDs match those of your host system, while also not squashing them all
to a single UID the way a uid= option would.

As with this series, a mapping approach wouldn't require allowing mounts
inside the namespace. You *could* mount from within the namespace if
you want and you have appropriate access to do so, mapping UIDs/GIDs on
the disk to UIDs/GIDs you have in your namespace. However, you could
also do the mount with mapping from outside the namespace, to non-root
UIDs/GIDs, and then use those same UIDs/GIDs in a userns mapping to map
them to root.

The main design constraint with a full mapping would be passing that
through "mount". There have been discussions on and off for years about
replacing the mount() system call with something either two-phase (get
filesystem driver FD, send it a series of parameters ending with mount;
the VFS would interpret many of those parameters) or three-phase (get
filesystem driver FD, send it parameters ending with getting a directory
FD, bind the directory FD). Given an interface like that, providing a
UID/GID map at mount time seems plausible.

Alternatively, a much simpler approach that could potentially be
expanded in the future would be to add *two* parameters each for UID and
GID: a base and a max. That would define a range, which doesn't
necessarily need to be exactly 2**16; thus, if you had a big enough
range, that approach would nest as well.

- Josh Triplett