Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Sat May 14 2016 - 05:53:29 EST


On Thu, May 12, 2016 at 03:24:12PM -0700, James Bottomley wrote:
> On Thu, 2016-05-12 at 20:55 +0100, Djalal Harouni wrote:
> > On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> > > On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
[...]

> > >
> > > The credentials are per thread, so it's a standard way of doing
> > > credential shifting and no other threads of execution in the same
> > > task get access. As long as you bound the override_creds()/revert_c
> > > reds() pairs within the kernel, you're safe.
> >
> > No, and here sorry I mean shifted.
> >
> > current_fsuid() is global through all fs operations which means it
> > crosses user namespaces... it was safe the days of only init_user_ns,
> > not anymore... You give a mapping inside containers to fsuid where
> > they don't want to have it... this allows to operate on inodes inside
> > other containers... update current_fsuid() even if we want that user
> > to be nobody inside the container... and later it can access the
> > inodes of the shifted fs... and by same current of course...
>
> OK, I still don't understand what you're getting at. There are three
> per-thread uids: uid, euid and fsuid (real, effective and filesystem).
> They're all either settable via syscall or inherited on fork. They're
> all kernel side, meaning they're kuid_t. Their values stay invariant
> as you move through namespaces. They change (and get mapped according
> to the current user namespace setting) when you call set[fe]uid() So
> when I enter a user namespace with mapping
>
> 0 100000 1000
>
> and call setuid(0) (which sets all three). they all pick up the kuid_t
> of 100000. This means that writing a file inside the user namespace
> after calling setuid(0) appears as real uid 100000 on the medium even
> though if I call getuid() from the namespace, I get back 0. What
> shiftfs does is hijack temporarily the kernel fsuid/fsgid for
> permission checks, so you can remap to any old uid on the medium
> (although usually you'd pass in uidmap=0:100000:1000") it maps back
> from kuid_t 100000 to kuid_t 0, which is why the container can now read
> and write the underlying medium at on-media id 0 even through root
> inside the container has kuid_t 100000. There's no permanent change of
> fsuid and it stays at its invariant value for the thread except as a
> temporary measure to do the permission checks on the underlying of the
> shifted filesystem.
>
> > > > The worst thing is that current_fsuid() does not follow now the
> > > > /proc/self/uid_map interface! this is a serious vulnerability and
> > > > a mix of the current semantics... it's updated but using other
> > > > rules...?
> > >
> > > current_fsuid() is aready mapped via the userns; it's already a
> > > kuid_t at its final value. Shifting that is what you want to remap
> > > underlying volume uid/gid's. The uidmap/gidmap inputs to this are
> > > shifts on the final underlying uid/gids.
> >
> > => some points:
> > Changing setfsuid() its interfaces and rules... or an indrect way to
> > break another syscall...
>
> There is no change to setfsuid().
>
> > The userns used for *mapping* is totatly different and not standard..
> > . losing "init_user_ns and its decendents userns *semantics*...", a
> > yet a totatly unlinked mapping...
>
> There is no user namespace mapping at all. This is a simple shift,
> kernel side, of uids and gids at their kuid_t values.
>
> > Breaking current_uid(),current_euid(),current_fsuid() which are
> > mapped but in *different* user namespaces... hence different values
> > inside namespaces... you can change your userns mapping but that
> > current_fsuid specific one will always be remapped to some other
> > value inside even if you don't want it... It crosses user
> > namespaces... uid and euid are remapped according to /proc/self/uid_
> > map, fsuid is remapped according to this new interface...
> >
> > Hard coding the mapping, nested containers/apps may *share* fsuid and
> > can't get rid of it even if they change the inside userns mapping to
> > disable, split, reduce mapped users or offer better isolation they
> > can't... no way to make private inodes inside containers if they
> > share the final fsuid, inside container mapping is ignored...
> > ...
>
> OK, I think there's a misunderstanding about how credential overrides
> work. They're not permanent changes to the credentials, they're
> temporary ones to get stuff done within the kernel at a temporary
> privilege. You can make credentials permanent if you go through
> prepare_creds()/commit_creds(), but for making them temporary you do
> prepare_creds()/override_creds() and then revert_creds() once you're
> done using them.
>
> If you want to see a current use of this, try fs/open.c:faccessat.
> What it's doing is temporarily overriding fsuid with the real uid to
> check the permissions before reverting the credentials and returning to
> the user.

Thank you for explaining things, but I think you should take the time to
read this RFC and understand some problems. This is a quick dump of some
problems that it avoids...:

In this series we don't hijack setfsuid() in an indirect way, setfsuid
maps UIDs into current userns according to rules set by parent.
Changing current_fsuid() to some other mapping is a way to allow
processes to bypass that and use it to access other inodes...
This should not change and fsuid should continue to follow these
rules...

A cred->fsuid solution is safe or used to be safe only inside
init_user_ns where there is always a mapping or in context of current
user namespace. In an other user namespace with 0:1000:1 mapping, you
can't set it to arbitrary mapping like 0:4000:1... It will give confined
processes access to inodes that satisfy the kuid_t 4000 mapping and
which the app/container wants to deny, they only want 0:1000:1. ..

We don't cross user namespaces, we don't use different mappings for
cred->uid, cred->fsuid... A clean solution is to shift inodes UID/GID
and not change fsuid to cross namespaces. Not to mention how it may
interact with capabilities...

We follow user namespace rules and we keep "the parent defines a range
that the children can't escape" semantics. There is a clear relation
between user namespaces that should not be broken.

We explicitly don't define a new user namespace mapping nor add a new
interface for the simple reason it's: *too complicated*. We can do that,
but no thanks! May be in future if there is a real need or things are
clear...
The current user namespace interface is getting standard and stable, so
we just keep it that way and make it consistant inside VFS.

We give VFS control of that, and we make mount namespaces the central
part of this whole logic.

We make admins life easier where they can pull container images, root
filesystems from containers/apps hubs... verify the signature and start
them with different mappings according to host resources... We don't
want them to do anything.
The design was planned to make it easier for users, it should work out
of the box, and it can be used to handle complex stuff too, since it's
flexible.

Able to support most filesystems including on-disk filesystems natively.

Able to support disk quota according to the shifted UID/GID on-disk
values. Especially during inode creation...

Able to support ACL if requested.

The user namespace mapping is kept a runtime configure option, we don't
pin a special mapping at any time, and of course parent creator of user
namespace is the one that can manipulate it, at the same time the
mapping is restricted according to grandpa rules and so on...

It allows unprivileged to use the VFS UID/GID shift without the
intervention of a privileged process each time.
The real privileged process sets the filesystem and the mount namespace
the first time, then it should work for all nested namespaces and
containers. It does not need the intervation of init_user_ns root to
set the mapping and make it work, you don't have to go in and go out to
setup the thing, etc.

We don't do this on behalf of filesystems, they should explicitly
support it. procfs and other host resource virtual filesystems are safe
and currently they don't need shifting.

We try to fix the problem where it should be fixed, and not hide it...

--
Djalal Harouni
http://opendz.org