[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Wed May 04 2016 - 10:47:31 EST

This is version 2 of the VFS:userns support portable root filesystems
RFC. Changes since version 1:

* Update documentation and remove some ambiguity about the feature.
Based on Josh Triplett comments.
* Use a new email address to send the RFC :-)

This RFC tries to explore how to support filesystem operations inside
user namespace using only VFS and a per mount namespace solution. This
allows to take advantage of user namespace separations without
introducing any change at the filesystems level. All this is handled
with the virtual view of mount namespaces.

1) Presentation:

The main aim is to support portable root filesystems and allow containers,
virtual machines and other cases to use the same root filesystem.
Due to security reasons, filesystems can't be mounted inside user
namespaces, and mounting them outside will not solve the problem since
they will show up with the wrong UIDs/GIDs. Read and write operations
will also fail and so on.

The current userspace solution is to automatically chown the whole root
filesystem before starting a container, example:
(host) init_user_ns 1000000:1065536 => (container) user_ns_X1 0:65535
(host) init_user_ns 2000000:2065536 => (container) user_ns_Y1 0:65535
(host) init_user_ns 3000000:3065536 => (container) user_ns_Z1 0:65535

Every time a chown is called, files are changed and so on... This
prevents to have portable filesystems where you can throw anywhere
and boot. Having an extra step to adapt the filesystem to the current
mapping and persist it will not allow to verify its integrity, it makes
snapshots and migration a bit harder, and probably other limitations...

It seems that there are multiple ways to allow user namespaces combine
nicely with filesystems, but none of them is that easy. The bind mount
and pin the user namespace during mount time will not work, bind mounts
share the same super block, hence you may endup working on the wrong
vfsmount context and there is no easy way to get out of that...

Using the user namespace in the super block seems the way to go, and
there is the "Support fuse mounts in user namespaces" [1] patches which
seem nice but perhaps too complex!? there is also the overlayfs solution,
and finaly the VFS layer solution.

We present here a simple VFS solution, everything is packed inside VFS,
filesystems don't need to know anything (except probably XFS, and special
operations inside union filesystems). Currently it supports ext4, btrfs
and overlayfs. Changes into filesystems are small, just parse the
vfs_shift_uids and vfs_shift_gids options during mount and set the
appropriate flags into the super_block structure.

1) Filesystems don't need the FS_USERNS_MOUNT flag, so no user
namespace mounting, they stay secure, nothing changes.

2) The solution is based on VFS and mount namespaces, we use the user
namespace of the containing mount namespace to check if we should shift
UIDs/GIDs from/to virtual <=> on-disk view.
If a filesystem was mounted with "vfs_shift_uids" and "vfs_shift_gids"
options, and if it shows up inside a mount namespace that supports VFS
UIDs/GIDs shifts then during each access we will remap UID/GID either
to virtual or to on-disk view using simple helper functions to allow the
access. In case the mount or current mount namespace do not support VFS
UID/GID shifts, we fallback to the old behaviour, no shift is performed.

3) The existing user namespace interface is the one used to do the
translation from virtual to on-disk mapping.

3) inodes will always keep their original values which reflect the
mapping inside init_user_ns which we consider the on-disk mapping.

3.1) During access we map to the virtual view, and if the
inode->{i_uid|i_gid} do not have a mapping in the mount namespace
we construct one for them.

3.2) For on-disk write we construct the appropriate kuid/kgid that
should be stored on-disk. If they have a mapping in the mount
namespace we use the corresponding uid_t/gid_t values of that
mapping inside the mount namespace and construct the kuid from
the pair init_user_ns and uid_t. This covers cases where the
mapping inside should be the one stored into on-disk. Now If they
don't have a mapping in the mount namespace, we fallback to the
old behaviour, the global kuid inside init_user_ns is the one
used to update the inode->i_uid.

As an example if the mapping 0:65535 inside mount namespace and outside
is 1000000:1065536, then 0:65535 will be the range that we use to
construct UIDs/GIDs mapping into init_user_ns and use it for on-disk
data. They represent the persistent values that we want to write to the
disk. Therefore, we don't keep track of any UID/GID shift that was applied
before, it gives portability and allows to use the previous mapping
which was freed for another root filesystem...

If the mapping inside the mount namespace is 1000:65535 and outside
is 2000:65535 then the range used to construct UIDs/GIDs mapping to
update inode->{i_uid|i_gid} will be the one inside the container, we
always use that one to construct the kuid/kgid from uid_t/gid_t and

$ cat /proc/self/uid_map
1000 2000 65536
$ stat -c '%u:%g' mountpoint/etc/fedora-release
$ stat -c '%u:%g' mountpoint/home/tixxdz/
$ touch mountpoint/newuser
touch: cannot touch âmountpoint/newuserâ: Permission denied
$ stat -c '%u:%g' mountpoint/home/tixxdz/newuser
[ outside of namespaces] $ stat -c '%u:%g' mountpoint/home/tixxdz/newuser

Please note that the range here is not hardcoded to 65535, it can be any
value set by the creator of the user namespace. These patches use the
only interface user namespaces provide. 2**16 was used here to just show
how filesystems can be made portable by making the most used UIDs/GIDs
available inside containers.

Simple demo overlayfs, and btrfs mounted with vfs_shift_uids and
vfs_shift_gids. The overlayfs mounts will share the same upperdir. We
create two user namesapces every one with its own mapping and where
container-uid-2000000 will pull changes from container-uid-1000000
upperdir automatically.

[tixxdz@fedora-kvm btrfs_root]$ mount | grep btrfs
/dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs (rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/)
[tixxdz@fedora-kvm btrfs_root]$ sudo mount -t overlay overlay \
-o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=/mnt/btrfs_root/container-uid-1000000/upperdir,workdir=/mnt/btrfs_root/container-uid-1000000/workdir \
[tixxdz@fedora-kvm btrfs_root]$ sudo mount -t overlay overlay \
-o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=/mnt/btrfs_root/container-uid-1000000/upperdir,workdir=/mnt/btrfs_root/container-uid-2000000/workdir \
[tixxdz@fedora-kvm btrfs_root]$ sudo chown -R 1000000.1000000 /mnt/btrfs_root/container-uid-1000000/workdir/work/
[tixxdz@fedora-kvm btrfs_root]$ sudo chown -R 2000000.2000000 /mnt/btrfs_root/container-uid-2000000/workdir/work/

[ Term 1 ]
[tixxdz@fedora-kvm container-uid-1000000]$ sudo ~/bin/mountns-uidshift -u 1000000
bash: /root/.bashrc: Permission denied
bash-4.3# cat /proc/self/uid_map
0 1000000 65536
bash-4.3# touch container-uid-1000000/merged/rootfile
bash-4.3# stat -c '%u:%g' container-uid-1000000/merged/rootfile

[ Term 2 ]
[tixxdz@fedora-kvm btrfs_root]$ sudo ~/bin/mountns-uidshift -u 2000000
[sudo] password for tixxdz:
bash: /root/.bashrc: Permission denied
bash-4.3# cat /proc/self/uid_map
0 2000000 65536
bash-4.3# stat -c '%u:%g' container-uid-2000000/merged/rootfile

[ Term 3 ] (outside of all namespaces)
[tixxdz@fedora-kvm btrfs_root]$ stat -c '%u:%g' container-uid-1000000/upperdir/rootfile

This means that root in user namespace or inside containers is able to
write inodes with uid/gid == 0 into disk. This may sound strange and
dangerous, yes of course, care must be taken, this way we have added
the following:

1) Filesystems when mounted must explicitly support "vfs_shift_uids"
and "vfs_shift_gids", we don't require mounting inside user namespaces.

2) Containers or mounts can have their parent directory as 0700, and
even before mounting clean the mount namespace, set the appropriate
propagation flags and so on...

3) To be able to set the CLONE_MNTNS_SHIFT_UIDGID flag on the new mount
namespace either caller has to be real root in init_user_ns, or the parent
of the new mount namespace has already that flag set. This allows
nesting which I discussed briefly with Serge Hallyn, and he suggested
that this should be supported. Preventing nesting is doomed to fail. This
way we have security and nesting at the same time. Of course if you clean
that flag you won't be able to set it next time only if you are capable
in init_user_ns.

4) If the mount namespaces has the flag CLONE_MNTNS_SHIFT_UIDGID set but
the filesystem was mounted without "vfs_shift_uids" and "vfs_shift_gids"
or does not support these options, then no shifting is performed. You
have to meet the two conditions at each access, otherwise we fallback to
current behaviour.

5) Only the creator of the mount namespace or one with similar
privileges is able to change the mapping rules of the user namespace of
that mount namespace. This ensures that only a more privileged is able
to change the mapping and at the same time it gives some flexibility
since the rules can be changed, and we never persist the virtual
UIDs/GIDs into disk, only the view in init_user_ns is always stored into

To complete this solution the current blocker is: since we need a way to
control mount namespaces we need a new flag, however all flags of current
clone() syscall are consumed, yes 32bits no luck! In this RFC I didn't
include a new syscall clone4() [2] which was already requested in the past,
and the patches for a new clone4() are already there. This way this RFC
stays minimal.

The flag we use here is just for demonstration, please see patch 0001
and the program mountns-uidshift.c [3] for that. Future versions
will include the new clone4() syscall.

2) TEST:

Apply on top of Linux 4.6-rc6 HEAD 04974df8049fc4240d2275, and use this
program mountns-uidshift.c to test the shifted mount namespaces.

With current mapping rules init_user_ns:
[1000000:1065536] => new_user_ns: [0:65536]
# cat /proc/self/uid_map
0 1000000 65536
# cat /proc/self/gid_map
0 1000000 65536

2.1) ext4:

/ on ext4 without vfs_shift_uids, vfs_shift_gids
/mnt/ext4_root on ext4 with vfs_shift_uids, vfs_shift_gids
/mnt/ext4_root/rootfs/fedore-tree (Another fedora rootfs)
/mnt/ext4_root/container-uid-1000000 (container files with uid 1000000)
/mnt/ext4_root/container-uid-1000000/mountpoint (bind mount of fedora-tree)

$ sudo mount -t ext4 -ovfs_shift_uids,vfs_shift_gids \
/dev/fedora/ext4_root /mnt/ext4_root/
$ mount | grep ext4 -
/dev/mapper/fedora-root on / type ext4 (rw,relatime,seclabel,data=ordered)
/dev/sda1 on /boot type ext4 (rw,relatime,seclabel,data=ordered)
/dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
$ sudo mkdir /mnt/ext4_root/rootfs/
$ sudo yum -y --releasever=23 --installroot=/mnt/ext4_root/rootfs/fedora-tree \
--disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim
$ sudo mkdir /mnt/ext4_root/container-uid-1000000/
$ sudo mkdir /mnt/ext4_root/container-uid-1000000/mountpoint
$ sudo chown -R 1000000.1000000 /mnt/ext4_root/container-uid-1000000/
$ sudo mount --bind -ovfs_shift_uids,vfs_shift_gids \
/mnt/ext4_root/rootfs/fedora-tree/ mountpoint/
$ mount | grep vfs_shift_uids -
/dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
/dev/mapper/fedora-ext4_root on /mnt/ext4_root/container-uid-1000000/mountpoint type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
$ sudo ~/bin/mountns-uidshift -u 1000000
bash-4.3# id
uid=0(root) gid=0(root) groups=0(root) context=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
bash-4.3# cat /proc/self/uid_map
0 1000000 65536
bash-4.3# mount | grep shift -
/dev/mapper/fedora-ext4_root on /mnt/ext4_root type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
/dev/mapper/fedora-ext4_root on /mnt/ext4_root/container-uid-1000000/mountpoint type ext4 (rw,relatime,seclabel,data=ordered,vfs_shift_uids,vfs_shift_gids)
bash-4.3# stat -c '%u:%g' /etc/motd
bash-4.3# stat -c '%u:%g' /mnt/ext4_root/rootfs/fedora-tree/etc/motd
bash-4.3# stat -c '%u:%g' mountpoint/etc/motd
bash-4.3# stat -c '%u:%g' /etc/machine-id
bash-4.3# echo "blabla" > /etc/machine-id
bash: /etc/machine-id: Permission denied
bash-4.3# stat -c '%u:%g' mountpoint/etc/machine-id
bash-4.3# sha1sum mountpoint/etc/machine-id
edb24591988f0f003cd397704f49e92208b3015f mountpoint/etc/machine-id
bash-4.3# m=$(cat /dev/urandom | tr -cd 'a-f0-9' | head -c 32); echo $m | sha1sum; echo $m > mountpoint/etc/machine-id
f256a796b1f2ed09c4107f1f5aff2568fb2d79cc -
bash-4.3# sha1sum mountpoint/etc/machine-id
f256a796b1f2ed09c4107f1f5aff2568fb2d79cc mountpoint/etc/machine-id
bash-4.3# stat -c '%u:%g' mountpoint/etc/machine-id
[outside of namespaces]$ stat -c '%u:%g' /mnt/ext4_root/container-uid-1000000/mountpoint/etc/machine-id

Test with unprivileged user inside the new mount and user namespaces:

Test with uid tixxdz == 1000, the user exists on both:
(1) /
(2) /mnt/ext4_root/rootfs/fedore-tree which is bind mounted into /mnt/ext4_root/container-uid-1000000/mountpoint

-bash-4.3$ touch /home/tixxdz/newfile
touch: cannot touch /home/tixxdz/newfile: Permission denied
-bash-4.3$ stat -c '%u:%g' /home/tixxdz/
-bash-4.3$ stat -c '%u:%g' mountpoint/home/tixxdz/
-bash-4.3$ touch mountpoint/home/tixxdz/newfile
-bash-4.3$ stat -c '%u:%g' mountpoint/home/tixxdz/newfile
[outside of namespaces]$ stat -c '%u:%g' /mnt/ext4_root/container-uid-1000000/mountpoint/home/tixxdz/newfile

2.2) btrfs:
Same steps as ext4.

2.3) overlayfs:

2.3.1) Native support using VFS:

Overlayfs is natively supported if lowerdir, upperdir and workdir are all
on a mount that supports vfs_shift_uids and vfs_shift_gids flags and we
are in a mount namespace that also supports that.

$ mount | grep btrfs
/dev/mapper/fedora-btrfs_root on /mnt/btrfs_root type btrfs (rw,relatime,seclabel,space_cache,vfs_shift_uids,vfs_shift_gids,subvolid=5,subvol=/)
$ cd /mnt/btrfs_root/
$ sudo mkdir -p container-uid-2000000/{upperdir,workdir,merged}
$ sudo chown -R 2000000.2000000 container-uid-2000000/
$ cd container-uid-2000000/
$ sudo mount -t overlay overlay -o,lowerdir=/mnt/btrfs_root/rootfs/fedora-tree,upperdir=upperdir,workdir=workdir merged
$ sudo chown -R 2000000.2000000 workdir/work/
$ sudo ~/bin/mountns-uidshift -u 2000000
bash-4.3# stat -c '%u:%g' merged/etc/passwd
bash-4.3# touch merged/overlayfs-file
bash-4.3# stat -c '%u:%g' merged/overlayfs-file
[outside of namespaces]# stat -c '%u:%g' /mnt/btrfs_root/container-uid-2000000/merged/overlayfs-file
[outside of namespaces]# stat -c '%u:%g' /mnt/btrfs_root/container-uid-2000000/upperdir/overlayfs-file

2.3.2) Complex support or union filesystems:

If overlayfs lowerdir and upperdir are not on a filesystem that supports
natively vfs_shift_uids and vfs_shift_gids then to support VFS UID/GID
shifts, we must adapt the helper functions that where introduced in this
series to take also a super_block struct and test if the appropriate flags
where set into overlayfs instead of the other filesystem which the inode
belongs to. The translation on-disk <=> virtual should happen then inside

I think this will always be the case of union mounts which fetch an inode
from another mount. I think that solution (2.3.2) can also be implemented,
I had some ugly patches to implement this on top of overlayfs, but not
sure, better see what others think about VFS UID/GID shifts first.

IMO solution (2.3.1) if done correctly is the way to go, in the end all
this relates to the virtual view of UID/GID inside the kernel, and how
resources are translated to them, it's not related to overlayfs.

* Confirm current design, and make sure that the mapping is done

* Add clone4() syscall [2]

* Investigate if current setns() checks to enter new mount namespaces
are sufficient ?

* Add POSIX ACL support ?

* Check if all filesystem operations are correctly supported and recheck
permissions access.

* Do filesystems provide some operations to control disk or host resources ?
in other words are there some inodes on filesystems that allow to access
host resources, if so then maybe these inodes either should be marked only
safe in init_user_ns or get the appropriate capable() in init_user_ns if
missing. Needs investigation.

* Add XFS support.

[1] https://www.redhat.com/archives/dm-devel/2016-April/msg00368.html
[2] https://lkml.org/lkml/2015/3/15/10
[3] https://raw.githubusercontent.com/OpenDZ/research/master/kernel/mountns-uidshift.c


[RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
[RFC v2 PATCH 1/8] VFS: add CLONE_MNTNS_SHIFT_UIDGID flag to allow mounts to shift their UIDs/GIDs
[RFC v2 PATCH 2/8] VFS:uidshift: add flags and helpers to shift UIDs and GIDs to virtual view
[RFC v2 PATCH 3/8] fs: Treat foreign mounts as nosuid
[RFC v2 PATCH 4/8] VFS:userns: shift UID/GID to virtual view during permission access
[RFC v2 PATCH 5/8] VFS:userns: add helpers to shift UIDs and GIDs into on-disk view
[RFC v2 PATCH 6/8] VFS:userns: shift UID/GID to on-disk view before any write to disk
[RFC v2 PATCH 7/8] ext4: add support for vfs_shift_uids and vfs_shift_gids mount options
[RFC v2 PATCH 8/8] btrfs: add support for vfs_shift_uids and vfs_shift_gids mount options

Diffstat for this RFC
fs/attr.c | 44 +++++++++++++++++++++++--------
fs/btrfs/super.c | 15 ++++++++++-
fs/exec.c | 2 +-
fs/ext4/super.c | 14 ++++++++++
fs/inode.c | 9 ++++---
fs/mount.h | 1 +
fs/namei.c | 6 +++--
fs/namespace.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/stat.c | 4 +--
include/linux/fs.h | 14 ++++++++++
include/linux/mount.h | 1 +
include/linux/user_namespace.h | 8 ++++++
include/uapi/linux/sched.h | 1 +
kernel/capability.c | 14 ++++++++--
kernel/fork.c | 4 +++
kernel/user_namespace.c | 13 ++++++++++
security/commoncap.c | 2 +-
security/selinux/hooks.c | 2 +-
18 files changed, 319 insertions(+), 25 deletions(-)