Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Thu May 05 2016 - 18:35:27 EST

On Wed, May 04, 2016 at 06:44:14PM -0700, Andy Lutomirski wrote:
> On Wed, May 4, 2016 at 5:23 PM, Dave Chinner <david@xxxxxxxxxxxxx> wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> >> This is version 2 of the VFS:userns support portable root filesystems
> >> RFC. Changes since version 1:
> >>
> >> * Update documentation and remove some ambiguity about the feature.
> >> Based on Josh Triplett comments.
> >> * Use a new email address to send the RFC :-)
> >>
> >>
> >> This RFC tries to explore how to support filesystem operations inside
> >> user namespace using only VFS and a per mount namespace solution. This
> >> allows to take advantage of user namespace separations without
> >> introducing any change at the filesystems level. All this is handled
> >> with the virtual view of mount namespaces.
> >
> > [...]
> >
> >> As an example if the mapping 0:65535 inside mount namespace and outside
> >> is 1000000:1065536, then 0:65535 will be the range that we use to
> >> construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> >> data. They represent the persistent values that we want to write to the
> >> disk. Therefore, we don't keep track of any UID/GID shift that was applied
> >> before, it gives portability and allows to use the previous mapping
> >> which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> I think the intent is a totally separate superblock for each
> container. Djalal, am I right?

Absolutely that would be ideal, each container will mount its image
device into the new mount namespace, setting up the right private/slave
flags, no propagation into host... using GPT, lvm, loop or any other
backing device, the mount will show up only into the container...

Now as you know we can't prevent all flawed solutions. The thing that
I made sure is that the flag CLONE_MNTNS_SHIFT_UIDGID could only be
set by real root.

> The feature that seems to me to be missing is the ability to squash
> uids. I can imagine desktop distros wanting to mount removable
> storage such that everything shows up (to permission checks and
> stat()) as the logged-in user's uid but that the filesystem sees 0:0.
> That can be done by shifting, but the distro would want everything
> else on the filesystem to show up as the logged-in user as well.
> That use case could also be handled by adding a way to tell a given
> filesystem to completely opt out of normal access control rules and
> just let a given user act as root wrt that filesystem (and be nosuid,
> of course). This would be a much greater departure from current
> behavior, but would let normal users chown things on a removable
> device, which is potentially nice.

Ok Andy, this one is hard... I gave it some thought and what do you
think of the above:
It will work only if you are referring to some high level software
into distros which seems perfect of course for normal users.

So the sotfware should do:

1) mount the removable storage with vfs_shift_uids and vfs_shift_gids
2) Now the software should act as a container, make a

=> Setup the right mapping so we are able to access files...

The mount will show up into the new mount namespace.

3) Now inside new namespaces we are able to access all files.

4) Use stat() returned values, and shift back to logged-in user

The software did setup the mapping so it already knows who maps to who!

This allows to show results of stat() as they are normal logged-in
users, where everything works as you have described. So maybe this
has its place in a small userspace helper library where all softwares
can use it ?! thoughts ?

> --Andy


Djalal Harouni