Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

From: James Bottomley
Date: Mon Feb 06 2017 - 09:41:38 EST


On Mon, 2017-02-06 at 08:59 +0200, Amir Goldstein wrote:
> On Mon, Feb 6, 2017 at 3:18 AM, James Bottomley
> <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > On Sun, 2017-02-05 at 09:51 +0200, Amir Goldstein wrote:
> > > On Sat, Feb 4, 2017 at 9:19 PM, James Bottomley
> > > <James.Bottomley@xxxxxxxxxxxxxxxxxxxxx> wrote:
> > > > This allows any subtree to be uid/gid shifted and bound
> > > > elsewhere. It does this by operating simlarly to overlayfs.
> > > > Its primary use is for shifting the underlying uids of
> > > > filesystems used to support unpriviliged (uid shifted)
> > > > containers. The usual use case here is that the container is
> > > > operating with an uid shifted unprivileged root but sometimes
> > > > needs to make use of or work with a filesystem image that has
> > > > root at real uid 0.
> > > >
> > > > The mechanism is to allow any subordinate mount namespace to
> > > > mount a shiftfs filesystem (by marking it FS_USERNS_MOUNT) but
> > > > only allowing it to mount marked subtrees (using the -o mark
> > > > option as root). Once mounted, the subtree is mapped via the
> > > > super block user namespace so that the interior ids of the
> > > > mounting user namespace are the ids written to the filesystem.
> > > >
> > > > Signed-off-by: James Bottomley <
> > > > James.Bottomley@xxxxxxxxxxxxxxxxxxxxx>
> > > >
> > >
> > > James,
> > >
> > > Allow me to point out some problems in this patch and offer a
> > > slightly different approach.
> > >
> > > First of all, the subject says "uid/gid shifting bind mount", but
> > > it's not really a bind mount. What it is is a stackable mount and
> > > 2 levels of stack no less.
> >
> > The reason for the description is to have it behave exactly like a
> > bind mount. You can assert that a bind mount is, in fact, a
> > stacked mount, but we don't currently. I'm also not sure where you
> > get your 2 levels from?
> >
>
> A bind mount does not incur recursion into VFS code, a stacked fs
> does. And there is a programmable limit of stack depth of 2, which
> stacked fs need to comply with. Your proposed setup has 2 stacked fs,
> the mark shitfs by admin and the uid shitfs by container user. Or
> maybe I misunderstood.

Oh, right, actually, it wouldn't be 2 because once the unprivileged
mount uses the marked filesystem, what it uses is the mnt and dentry
from the underlying filesystem (what you would have got from a path
lookup on it).

That said, it does perform recursive calls to the underlying filesystem
unlike a true bind mount, so I can add the depth easily enough.

> > > So one thing that is missing is increasing of sb->s_stack_depth
> > > and that also means that shiftfs cannot be used to recursively
> > > shift uids in child userns if that was ever the intention.
> >
> > I can't think of a use case that would ever need that, but perhaps
> > other container people can.
> >
> > > The other problem is that by forking overlayfs functionality,
> >
> > So this wouldn't really be the right way to look at it: shiftfs
> > shares no code with overlayfs at all, so is definitely not a fork.
> > The only piece of functionality it has which is similar to
> > overlayfs is the way it does lookups via a new dentry cache.
> > However, that functionality is not unique to overlayfs and if you
> > look, you'll see that shiftfs_lookup() actually has far more in
> > common with ecryptfs_lookup().
>
> That's a good point. All stackable file systems may share similar
> problems and solutions (e.g. consistent st_ino/st_dev). Perhaps it
> calls for shared library code or more generic VFS code. At the moment
> ecryptfs is not seeing much development, so everything happens in
> overlayfs. If there is going to be more than 1 actively developed
> stackable fs, we need to see about that.

I believe we already do ... if you look at the lookup functions of each
of them, you see the only common thing is encapsulated in a variant of
the lookup_one_len() functions. After that, even simple things like
our negative dentry handling differs.

> > > shiftfs is going to miss out on overlayfs bug fixes related to
> > > user credentials differ from mounter credentials, like fd3220d
> > > ("ovl: update S_ISGID when setting posix ACLs"). I am not sure
> > > that this specific case is relevant to shiftfs, but there could
> > > be other.
> >
> > OK, so shiftfs doesn't have this bug and the reason why is
> > illustrative: basically shiftfs does three things
> >
> > 1. lookups via a uid/gid shifted dentry cache
> > 2. shifted credential inode operations permission checks on the
> > underlying filesystem
> > 3. location marking for unprivileged mount
> >
> > I think we've already seen that 1. isn't from overlayfs but the
> > functionality could be added to overlayfs, I suppose. The big
> > problem is 2. The overlayfs code emulates the permission checks,
> > which makes it rather complex (this is where you get your bugs like
> > the above from). I did actually look at adding 2. to overlayfs on
> > the theory that a single layer overlay might be closest to what
> > this is, but eventually concluded I'd have to take the special
> > cases and add a whole lot more to them ... it really would increase
> > the maintenance burden substantially and make the code an
> > unreadable rats nest.
> >
>
> The use cases for uid shifting are still overwelming for me.
> I take your word for it that its going to be a maintanace burdon
> to add this functionality to overlayfs.
>
> > When you think about it this way, it becomes obvious that the clean
> > separation is if shiftfs functionality is layered on top of
> > overlayfs and when you do that, doing it as its own filesystem is
> > more logical.
> >
>
> Yes, I agree with that statement. This is inline with the solution I
> outlined at the end of my previous email, where single layer
> overlayfs is used for the host "mark" mount, although I wonder if the
> same cannot be achieved with a bind mount?

I understand, but once I can't consume overlayfs to construct it, the
idea of trying to use it becomes a negative not a positive.

We could achieve the same thing using bind mounts, if the vfsmount
structure carried a private field, but it doesn't. I think given the
prevalence of this structure throughout the mount tree, that's a
deliberate decision to keep it thin.

> in host:
> mount -t overlay -o noexec,upper=<origin> container_visible <mark
> location>
>
> in container:
> mount -t shiftfs -o <mark location> <somewhere in my local mount ns>

So I'm not sure it's a more widespread problem: mount --bind is usable
inside an unprivileged container, which means you can bridge filesystem
subtrees even only being local container admin. The problem is
mounting other filesystems types. Marking a type safe for mounting is
done by the FS_USERNS_MOUNT flag but it means for things like shiftfs
that you do have to restrict the source location, but for most
filesystem types, that source will be a device, so they will need other
checking than a mount mark.

James