Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems
From: Djalal Harouni
Date: Tue May 10 2016 - 06:33:54 EST
On Mon, May 09, 2016 at 04:26:30PM +0000, Serge Hallyn wrote:
> Quoting Djalal Harouni (tixxdz@xxxxxxxxx):
> > Hi,
[...]
> >
> > After clone(CLONE_NEWUSER|CLONE_NEWNS|CLONE_MNTNS_SHIFT_UIDGID), setup
> > the user namespace mapping, I guess you drop capabilities, do setuid()
> > or whatever and start the PID 1 or the app of the container.
> >
> > Now and to not confuse more Dave, since he doesn't like the idea of
> > a shared backing device, and me neither for obvious reasons! the shared
> > device should not be used for a rootfs, maybe for read-only user shared
> > data, or shared config, that's it... but for real rootfs they should have
> > their own *different* backing device! unless you know what you are doing
> > hehe I don't want to confuse people, and I just lack time, will also
> > respond to Dave email.
>
> Yes. We're saying slightly different things. You're saying that the admin
> should assign different backing stores for containers. I'm saying perhaps
> the kernel should enforce that, because $leaks. Let's say the host admin
> did a perfect setup of a container with shifted uids. Now he wants to
> run a quick ps in the container... he does it in a way that leaks a
> /proc/pid reference into the container so that (evil) container root can
> use /proc/pid/root/ to get a toehold into the host /. Does he now have
> shifted access to that?
No. Assuming host / or its other mount points are not mounted with
vfs_shift_uids and vfs_shift_gids options. In this case no shift is
performed at all.
1) If you mount host / with vfs_shift_uids and vfs_shift_gids it's
like real root in init_user_ns does "chmod -R o+rwx /"... It does not make
sense and since no one can edit/remount mounts to change their options in
the mount namespace of init_user_ns, it's safe, and not available by
default.
2) That's why also filsystems must support this explicitly and not on
their behalf.
IMO the kernel is already enforcing this, so even if you assign different
backing stores to containers, you can't have shifted access there, unless
you explicitly tell the kernel that the mount is mean to be shifted by
adding vfs_shift_uids and vfs_shift_gids mount options.
> I think if we say "this blockdev will have shifted uids in /proc/$pid/ns/user",
> then immediately that blockdev becomes not-readable (or not-executable)
> in any namespace which does not have /proc/$pid/ns/user as an ancestor.
Hmm,
(1) This won't work since to do that you have to know in advance
/proc/$pid/ns/user and since file systems can't be mounted inside user
namespace this brings us to the same blocker ... ! and in our use case
we do want to shift UIDs/GIDs to just access inodes, no need to expose
the whole filesystem, root is responsible and filesystems stay safe.
(2) Why complicate ? the kernel already supports this! and it's a
generic solution.
As said you can just create new mount namespaces, mount things there
private, slave... mount your blockdev that will be shifted by processes
that inherits that mount, you can even have intermediate mount namespaces
that you will forget/unref at any moment and where they are only used to
perform setup, and no other process/code can enter... You don't have
any leaks nothing! you control that piece of code.
If you want that blockdev to become not-readable or noexec in any
namespace which does not have /proc/$pid/ns/user as an ancestor,
the kernel allows a better interface, it allows that blockdev to not
even show up in any ancestor, by making use of mount namespaces and
MS_PRIVATE, MS_SLAVE... no one will even notice if the mount exists.
However if you want to access that blockdev for whatever reason, then
create a new mount namespace and use MS_PRIVATE, MS_SLAVE and all the
noexec flags and mount it.
Yes slightly different things, but I don't want to add complexity where
the interface already exists in the kernel...
> With obvious check as in write-versus-execute exclusion that you cannot
> mark the blockdev shifted if ancestor user_ns already has a file open for
> execute.
Please note here, that it's the same ancestor who will mark the blockdev
to be shifted, but why the ancestor will keep at the same time a file
open in that filesystem that is mean to be shifted and later execute
through that fd a program that was just crafted by untrusted container ?!
For me the kernel already offers the interfaces no need to complicate
things or enforce it... As said in other responses, the design of these
patches is to just use what the kernel already provides.
> BTW, perhaps I should do this in a separate email, but here is how I would
> expect to use this:
>
> 1. Using zfs: I create a bare (unshifted) rootfs fs1. When I want to
> create a new container, I zfs clone fs1 to fs2, and let the container
> use fs2 shifted. No danger to fs1 since fs2 is cow. Same with btrfs.
Yes that would work, since fs1 is unshifted, the only requirement is
that fs2 should not reside on the same backing store of fs1 to not share
quota with fs1 (I'm not a ZFS user...) and you can always make the parent
of mount point fs2 or containers directories 0700... and root should not
go there and exec programs like it's not safe to go /hom/$user... and
exec...
> 2. Using overlay: I create a bare (unshifted) rootfs fs1. When I want
> to create a new container, I I mount fs1 read-only and shifted as base
> layer, then fs2 as the rw layer.
Yes here you may share quota if all the fs2 rw layers of all containers
reside on the same backing store... but here the requirement is that fs1
should be mounted the first time with shifted uids/gids where fs1 resides
on ext4, btrfs, xfs or anyother filesystem that supports shifting. This
means you may have to mount fs1 on a different backing store say on
/root-fs0/lib/container-image-fs1/ with vfs_shit_uids/gids then use it
as a shared read-only lower layer.
Of course you may just use your host / as a readonly layer where you
mount it the first time with vfs_shift_uids/gids but as discussed above
that's not really safe unless that's not a shared user system, or you
know what you are doing...
These patches do not edit overlayfs. overlayfs support is transparent if
the underlaying filesystems, the upper and lower directories are on
filesystems that support vfs_shift_uids/vfs_shift_gids.
If we go with overlayfs, we make it an overlayfs problem where it needs
different approache related to union mounts which I noted in the cover
letter of this patches.
> The point here is that the zfs clone plus container start takes (for a
> 600-800M rootfs) about .5 seconds on my laptop, while the act of shifting
> all the uids takes another 2 seconds. So being able do this without
> manually shifting would be a huge improvement for cases (i.e. docker)
> where you do lots and lots of quick deploys.
>
That's one of the use cases of course! you can also verify the
integrity... and able to really make containers fs read-only without
the recursive chown...
Thank you for your feedback!
--
Djalal Harouni
http://opendz.org