Re: [RFC v2 PATCH 0/8] VFS:userns: support portable root filesystems

From: Djalal Harouni
Date: Thu May 12 2016 - 15:56:51 EST


On Wed, May 11, 2016 at 11:33:38AM -0700, James Bottomley wrote:
> On Wed, 2016-05-11 at 17:42 +0100, Djalal Harouni wrote:
> > On Tue, May 10, 2016 at 04:36:56PM -0700, James Bottomley wrote:
[...]
> > Hmm anyway you are mounting this on behalf of filesystems, so if you
> > add the recursive thing, you will just probably make everything
> > worse, by making any /proc, /sys dentry that's under that path
> > shiftable, and unprivileged users can just create user namespaces and
> > read /proc/* and all the other stuff that doesn't have capable()
> > related to the init_user_ns host...
>
> That's up to the admin who does the shifting. Recursive would be an
> option if added.

Hmm, not sure if you get my point... you just made it an admin problem
where admins want to mount an image downloaded verify it and use it for
their container with /proc...! that's another problem!


> > what if you have paths like /filesystem0/uidshiftedY/dir,
> > /filesystem0/uidshiftedX/dir , /filesystem0/notshifted/dir
> > where some of them are also bind mounts that point to same dentry ?
>
> Without recursive semantics, you see the underlying inode. With them,
> you see the upper vfsmnts. Shiftfs isn't idempotent, so you would need
> to be careful about nesting. However, that's an admin problem.
>
> > Also, you create a totally new user namespace interface here! by
> > making your own new interface we just lose the notion of init_user_ns
> > and its children and mapping ?
>
> I don't quite understand this; the only use of the init_user_ns is the
> capable(CAP_SYS_ADMIN) in fill_super which is how only the real admin
> can mount at a shifted uid/gid. Otherwise, there's no need to see into
> the userns because filesystems see the kuid_t/kgid_t which is what I'm
> shifting.
>
> > I'm not sure of the implication of all this... your user namespace
> > mapping is not related at all to init_user_ns! it seems that it has
> > its own init_user_ns ? does a capable() check now on a shifted
> > filesystem relates to that and hence to your mapping or to the real
> > init_user_ns ?
>
> capable(CAP_SYS_ADMIN) == ns_capable(&init_user_ns, CAP_SYS_ADMIN)
>
> Or is there a misunderstanding here about how user namespaces work
> inside the kernel? The design is that the ID shift is done as you
> cross the kernel boundary, so a filesystem, being usually all in-kernel
> operating via the VFS interfaces, ideally never needs to make any
> from_kuid/make_kuid calls. However, there are ways filesystems can
> send data across the kernel/user bounary outside of the usual vfs
> interfaces (ioctls being the most usual one) so in that specific code,
> they have to do the kuid_t to uid_t changes themselves. Shiftfs never
> sends data to the user outside of the VFS so it never needs to do this
> and can operate entirely on kuid_ts.
>
> > > There's a bit of an open question of whether it should have vfs
> > > changes: the way the struct file f_inode and f_ops are hijacked is
> > > a bit nasty and perhaps d_select_inode() could be made a bit
> > > cleverer to help us here instead.
> >
> > I'm not sure if this PoC works... but you sure you didn't introduce
> > a serious vulnerability here ? you use a new mapping and you update
> > current_fsuid() creds up, which is global on any fs operation, so may
> > be: lets operate on any inode, update our current_fsuid()... and
> > access the rest of *unshifted filesystems*... !?
>
> The credentials are per thread, so it's a standard way of doing
> credential shifting and no other threads of execution in the same task
> get access. As long as you bound the override_creds()/revert_creds()
> pairs within the kernel, you're safe.

No, and here sorry I mean shifted.

current_fsuid() is global through all fs operations which means it
crosses user namespaces... it was safe the days of only init_user_ns,
not anymore... You give a mapping inside containers to fsuid where they
don't want to have it... this allows to operate on inodes inside other
containers... update current_fsuid() even if we want that user to be
nobody inside the container... and later it can access the inodes of
the shifted fs... and by same current of course...



> > The worst thing is that current_fsuid() does not follow now the
> > /proc/self/uid_map interface! this is a serious vulnerability and a
> > mix of the current semantics... it's updated but using other
> > rules...?
>
> current_fsuid() is aready mapped via the userns; it's already a kuid_t
> at its final value. Shifting that is what you want to remap underlying
> volume uid/gid's. The uidmap/gidmap inputs to this are shifts on the
> final underlying uid/gids.

=> some points:
Changing setfsuid() its interfaces and rules... or an indrect way to
break another syscall...

The userns used for *mapping* is totatly different and not standard...
losing "init_user_ns and its decendents userns *semantics*...", a
yet a totatly unlinked mapping...


Breaking current_uid(),current_euid(),current_fsuid() which are mapped
but in *different* user namespaces... hence different values inside
namespaces... you can change your userns mapping but that current_fsuid
specific one will always be remapped to some other value inside
even if you don't want it...
It crosses user namespaces... uid and euid are remapped according to
/proc/self/uid_map, fsuid is remapped according to this new interface...

Hard coding the mapping, nested containers/apps may *share* fsuid and
can't get rid of it even if they change the inside userns mapping to
disable, split, reduce mapped users or offer better isolation they
can't... no way to make private inodes inside containers if they share
the final fsuid, inside container mapping is ignored...

...

> the privileged ids down to 100000, but I have a volume which still has
> realids, I can mount that volume using shiftfs with
> uidmap=0:100000:1000 and it will allow this userns to read and write
> the volume through its remapped ids.
>
> > For overlayfs I did write an expriment but for me it's not an
> > overlayfs or another new filesystem problem... we are manipulating
> > UID/GID identities...
> >
> > It would have been better if you did send this as a separate thread.
> > It was a vfs:userns RFC fix which if we continue we turn it into a
> > complicated thing! implement another new light filesystem with
> > userns... (overlayfs...)
> >
> > Will follow up if the appropriate thread is created, not here, I
> > guess it's ok ?
>
> Well, I can resend the patch as a separate thread when I've fixed some
> of the problems viro pointed out.
>
> James
>
> > > James
> > >
> >
> > Thank you for your feedback!
> >
> >
>

Thanks!

--
Djalal Harouni
http://opendz.org