Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Dave Chinner
Date: Thu May 05 2016 - 22:50:47 EST


On Thu, May 05, 2016 at 11:24:35PM +0100, Djalal Harouni wrote:
> On Thu, May 05, 2016 at 10:23:14AM +1000, Dave Chinner wrote:
> > On Wed, May 04, 2016 at 04:26:46PM +0200, Djalal Harouni wrote:
> > > This is version 2 of the VFS:userns support portable root filesystems
> > > RFC. Changes since version 1:
> > >
> > > * Update documentation and remove some ambiguity about the feature.
> > > Based on Josh Triplett comments.
> > > * Use a new email address to send the RFC :-)
> > >
> > >
> > > This RFC tries to explore how to support filesystem operations inside
> > > user namespace using only VFS and a per mount namespace solution. This
> > > allows to take advantage of user namespace separations without
> > > introducing any change at the filesystems level. All this is handled
> > > with the virtual view of mount namespaces.
> >
> > [...]
> >
> > > As an example if the mapping 0:65535 inside mount namespace and outside
> > > is 1000000:1065536, then 0:65535 will be the range that we use to
> > > construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
> > > data. They represent the persistent values that we want to write to the
> > > disk. Therefore, we don't keep track of any UID/GID shift that was applied
> > > before, it gives portability and allows to use the previous mapping
> > > which was freed for another root filesystem...
> >
> > So let me get this straight. Two /isolated/ containers, different
> > UID/GID mappings, sharing the same files and directories. Create a
> > new file in a writeable directory in container 1, namespace
> > information gets stripped from on-disk uid/gid representation.
> >
> > Container 2 then reads that shared directory, finds the file written
> > by container 1. As there is no no namespace component to the uid:gid
> > stored in the inode, we apply the current namespace shift to the VFS
> > inode uid/gid and so it maps to root in container 2 and we are
> > allowed to read it?
>
> Only if container 2 has the flag CLONE_MNTNS_SHIFT_UIDGID set in its own
> mount namespace which only root can set or if it was already set in
> parent, and have access to the shared dir which the container manager
> should also configure before... even with that if in container 2 the
> shift flag is not set then there is no mapping and things work as they
> are now, but yes this setup is flawed! they should not share rootfs,
> maybe in rare cases, some user data that's it.

<head explods>

I can't follow any of the logic you're explaining - you just
confused me even more. I thought this was to allow namespaces with
different uid/gid mappings all to use the same backing store? And
now you're saying that "no, they'll all have separate backing
stores"?

I suspect you need to describe the layering in a way a stupid dummy
can understand, because trying to be clever with wacky examples is
not working.

> > Unless I've misunderstood something in this crazy mapping scheme,
> > isn't this just a vector for unintentional containment breaches?
> >
> > [...]
> >
> > > Simple demo overlayfs, and btrfs mounted with vfs_shift_uids and
> > > vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
> > > create two user namesapces every one with its own mapping and where
> > > container-uid-2000000 will pull changes from container-uid-1000000
> > > upperdir automatically.
> >
> > Ok, forget I asked - it's clearly intentional. This is beyond
> > crazy, IMO.
>
> This setup is flawed! that example was to show that files show up with
> the right mapping with two different user namespaces. As Andy noted
> they should have a backing device...

Did you mean "should have different backing devices" here? If not,
I'm even more confused now...

> Anyway by the previous paragraph what I mean is that when the container
> terminates it releases the UID shift range which can be re-used later
> on another filesystem or on the same previous fs... whatever. Now if
> the range is already in use, userspace should grab a new range give it
> a new filesystem or a previous one which doesn't need to be shared and
> everything should continue to work...

This sounds like you're talking about a set of single, sequential
uses of a single filesystem image across multiple different
container lifecycles? Maybe that's where I'm getting confused,
because I'm assuming multiple concurrent uses of a single filesystem
by all the running containers that are running the same distro
image....

> simple example with loop devices..., however the image should be a GPT
> (GUID partition table) or an MBR one...
>
> $ dd if=/dev/zero of=/tmp/fedora-newtree.raw bs=10M count=100
> $ mkfs.ext4 /tmp/fedora-newtree.raw
> ...
> $ sudo mount -t ext4 -oloop,rw,sync /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
> $ sudo yum -y --releasever=23 --installroot=/mnt/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim
> $ sudo mount -t ext4 -oloop,vfs_shift_uids,vfs_shift_gids, /var/lib/machines/fedora-newtree.raw /mnt/fedora-tree
> $ sudo ~/container --uidmap [1000000:1065536 or
> 2000000:2065536 or
> 3000000:3065536 ....}
> (That's the mapping outside of the container)

This doesn't match your above comments about separate backing
stores. Here we have two mounts sharing the same image file, both
mounted read/write - there's no separate backing store here. The
fact you hide the initial mount that was populated by yum by
overmounting the same mount point doesn't stop the original mount
from modifying the image file independently of the container you
started.

I'm getting the impression that there's a missing step in all your
examples here - that you create a writable snapshot or overlay of
the original fs image to create separate backing devices for each
container. In that case, the uid/gid shifting avoids needing to make
uid/gid modifications to the snapshot/overlay to match the
container's mapped uid/gids.

Similarly, if the use case given was read-only sharing of trees
between containers, there's no need for separate snapshots or
overlays, just a bunch of read-only (bind?) mounts with shifts
specified for the intended container.

These seem like a pretty sane use case for wanting to shift
uids/gids in this manner, but if that's the case then I'm struggling
to understand where the complexity in the description is coming
from.

> > > 3) ROADMAP:
> > > ===========
> > > * Confirm current design, and make sure that the mapping is done
> > > correctly.
> >
> > How are you going to ensure that all filesystems behave the same,
> > and it doesn't get broken by people who really don't care about this
> > sort of crazy?
>
> By trying to make this a VFS mount namespace parameter. So if the
> shift is not set on on the mount namespace then we just fallback to
> the current behaviour! no shift is performed.

That wasn't what I was asking - I was asking a code maintenance
question. i.e. someone will come along who doesn't quite understand
WTF all this convoluted namespace ID mapping is doing and they will
accidently break it in a subtle way that nobody notices because they
didn't directly change anything to do with ID shifting. What's the
plan for preventing that from happening?

> later of course I'll try xfstests and several tests...
>
> Does this answer your question ?

That's closer, but ambiguous. ;) Were you planning on just running
some existing tests or writing a set of regression tests that
explicitly encode expected usage and behaviour, as well as what is
expected to fail?

> > .....
> > > * Add XFS support.
> >
> > What is the problem here?
>
> Yep, sorry! just lack of time from my part! XFS currently is a bit aware
> of kuid/kgid mapping on its own, and I just didn't had the appropriate
> time! Will try to fix it next time.

You'd be talking about the xfs_kuid_to_uid/xfs_uid_to_kuid()
wrappers, right?

It comes to the kuid/kgid being kernel internal representations of
an ID, not an on-disk format representation. Like all other kernel
internal types they can change size and structure at any time, while
the persistent on-disk format cannot change without lots of hassle
(and then we really need conversion functions!). For clean layering,
abstraction and self-documenting code, internal types are always
converted to/from a persistent, on-disk format representation in
this manner.

> > Next question: how does this work with uid/gid based quotas?
>
> If you do a shift you should know that you will share quota on
> disk.

Yes, and this means you can't account for individual container space
usage on such mapped devices. Also, don't you need to shift
uids/gids for the quota syscalls like you do elsewhere?

I also wonder about the fact that the quota interfaces are likely to
return uids/gids that may not exist in a given container...

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx