Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Tue May 17 2016 - 11:42:50 EST


On Sat, May 14, 2016 at 06:46:54AM -0700, James Bottomley wrote:
> On Sat, 2016-05-14 at 10:53 +0100, Djalal Harouni wrote:
> > On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> > > On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > > > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:

[...]
> > In this series we don't hijack setfsuid() in an indirect way,
> > setfsuid maps UIDs into current userns according to rules set by
> > parent. Changing current_fsuid() to some other mapping is a way to
> > allow processes to bypass that and use it to access other inodes...
> > This should not change and fsuid should continue to follow these
> > rules...
>
> Both solutions do this

James, I don't update current_fsuid() nor any other creds field in this
RFC. For the reason that if I've a pinned mapping of 0:100000:65536 that
containers or apps want to use for their own purpose, an app X started
by privileged process and sets global uid to 100000 and its current user
namespace 0:100000:65536, and that app X forks another app Y with global
uid 100000 sandbox it, hide other processes, sets its user namespace
mapping to 1000:100000:1 for app Y, same thing for app Z 2000:100000:1
restrict the set of syscalls for both Y and Z... even with all this they
will be able keep their access to inode->i_uid == 0 where we don't want
that since we don't give a mapping to 0... we just want them to access
inode->i_uid == 1000 for app Y and 2000 for app Z... they cross user
namespaces... they use another mapping... and if Y forks to another app
and even if it sets a new userns mapping with a new restricted range, it
will continue to use the old range 65536 and inodes will show up with
real uids instead of nobody..


> > A cred->fsuid solution is safe or used to be safe only inside
> > init_user_ns where there is always a mapping or in context of current
> > user namespace. In an other user namespace with 0:1000:1 mapping,
> > you can't set it to arbitrary mapping like 0:4000:1... It will give
> > confined processes access to inodes that satisfy the kuid_t 4000
> > mapping and which the app/container wants to deny, they only want
> > 0:1000:1. ..
>
> OK, so both solutions are safe here too. Your safety comes from only
> remapping in the userns; mine comes from the normal filesystem acl
> rules: either the userns for different users all have disjoint ids
> regulated by /etc/subuidmap or they're all using the same one (like
> docker 1.10) in either case, you could regulate by having the mount
> under a directory which is accessible only to the userns owner.

Please see above comment. Nested unprivileged apps may want to restrict
syscall operations and access to inodes, maybe we don't want the forked
sandboxed app to have access to inodes, and it will be hard if not
impossible if you update global creds each time...


> > We don't cross user namespaces, we don't use different mappings for
> > cred->uid, cred->fsuid... A clean solution is to shift inodes
> > UID/GID and not change fsuid to cross namespaces. Not to mention how
> > it may interact with capabilities...
>
> This is a subjective question on what constitutes "clean". I think we
> both think the other solution isn't clean, so that's for others to
> adjudicate.

If you see it that way :-) , I just want to access from user namespace
in the safest way as possible, if there is a better solution or if my
patches are buggy, I'll drop them... no problem!


> > We follow user namespace rules and we keep "the parent defines a
> > range that the children can't escape" semantics. There is a clear
> > relation between user namespaces that should not be broken.
>
> OK, so I separated the problem into a userns one, which remaps for the
> processes in user space, and a vfs one which remaps the on-disk id.
> However, they could be combined by allowing the userns to mount
> shiftfs but only on designated filesystems and setting the uidmappings
> to the same ones as the userns.
>
> > We explicitly don't define a new user namespace mapping nor add a new
> > interface for the simple reason it's: *too complicated*. We can do
> > that, but no thanks! May be in future if there is a real need or
> > things are clear... The current user namespace interface is getting
> > standard and stable, so we just keep it that way and make it
> > consistant inside VFS.
>
> I don't accept the too complicated point. For fully unprivileged
> containers, the host admin already has to set up the subuid/subgid map
> files which is most of the complexity. Once that's done, the same maps
> can be used to shift mount. Once it's all set up, no further
> intervention is required.

Well, please check my first comment. In this RFC you don't have to be
always the real root or a privileged parent to do so... it allows nesting
since it seems that the maintainers want nesting support.


> > We give VFS control of that, and we make mount namespaces the central
> > part of this whole logic.
>
> Right, that's what causes the logic to thread throughout the entire vfs
> and into the fs layer. The fundamental point of difference is that I'd
> like a solution which encapsulates the problem rather than exposing it
> to the vfs.
>
> > We make admins life easier where they can pull container images, root
> > filesystems from containers/apps hubs... verify the signature and
> > start them with different mappings according to host resources... We
> > don't want them to do anything. The design was planned to make it
> > easier for users, it should work out of the box, and it can be used
> > to handle complex stuff too, since it's flexible.
>
> Either works easily for users. Setting stuff up is always the job of
> the admin in both solutions.


Hmm, I don't agree here, things should be safe by default and work out
of the box without the intervention of the admin.


> > Able to support most filesystems including on-disk filesystems
> > natively.
>
> Shiftfs does this. More importantly it supports subtrees, so I can
> unpack an image root on to an existing filesystem and remap it into a
> container.

This RFC should support subtrees of course, the mapping is done in the
context of the mount namespace of the caller.


> > Able to support disk quota according to the shifted UID/GID on-disk
> > values. Especially during inode creation...
>
> Quota can be shifted, I just wasn't sure it was necessary. If the
> usual use case is for unpacked roots, chances are you want the
> remapping to use the group quota of the userns owner, which they'd get
> naturally so, while it's possible to remap projid, I didn't think it
> needed to be done.
>
> > Able to support ACL if requested.
>
> Both do this.
>
> > The user namespace mapping is kept a runtime configure option, we
> > don't pin a special mapping at any time, and of course parent creator
> > of user namespace is the one that can manipulate it, at the same time
> > the mapping is restricted according to grandpa rules and so on...
> >
> > It allows unprivileged to use the VFS UID/GID shift without the
> > intervention of a privileged process each time. The real privileged
> > process sets the filesystem and the mount namespace the first time,
> > then it should work for all nested namespaces and containers. It does
> > not need the intervation of init_user_ns root to set the mapping and
> > make it work, you don't have to go in and go out to setup the thing,
> > etc.
>
> Both solutions work like this. When I use this for shifted roots of
> emulation containers, it's set up once at start of day. I then build
> the containers unprivileged using newsubuid/newsubgid as I'm using
> them. Once the shifts are done at start of day, no other admin support
> is required.

This RFC does not require the intervention of the admin or real root
process to adapt the mapping, when the filesystem is mounted with
shifted options and the mount namespace is created, everything is
inherited and you can use real separation for new nested containers/apps
if you want... you don't need the intervention of a privileged entity to
adapt the mapping at the start or after...

Just take a stock rootfs or an image and use it.

Please note that the approach this RFC takes was never discussed... I'll
let everyone sleep on it and see later after the merge window. Thanks!


> James
>
>
> > We don't do this on behalf of filesystems, they should explicitly
> > support it. procfs and other host resource virtual filesystems are
> > safe and currently they don't need shifting.
> >
> > We try to fix the problem where it should be fixed, and not hide
> > it...
>
>

--
Djalal Harouni
http://opendz.org