Re: [RFC 1/1] shiftfs: uid/gid shifting bind mount

From: Vivek Goyal
Date: Wed Feb 15 2017 - 09:17:41 EST


On Tue, Feb 14, 2017 at 03:45:55PM -0800, James Bottomley wrote:
> On Tue, 2017-02-14 at 18:03 -0500, Vivek Goyal wrote:
> > On Sun, Feb 05, 2017 at 05:18:11PM -0800, James Bottomley wrote:
> >
> > [..]
> > > > shiftfs is going to miss out on overlayfs bug fixes related to
> > > > user
> > > > credentials differ from mounter credentials, like fd3220d ("ovl:
> > > > update S_ISGID when setting posix ACLs"). I am not sure that this
> > > > specific case is relevant to shiftfs, but there could be other.
> > >
> > > OK, so shiftfs doesn't have this bug and the reason why is
> > > illustrative: basically shiftfs does three things
> > >
> > > 1. lookups via a uid/gid shifted dentry cache
> > > 2. shifted credential inode operations permission checks on the
> > > underlying filesystem
> > > 3. location marking for unprivileged mount
> > >
> > > I think we've already seen that 1. isn't from overlayfs but the
> > > functionality could be added to overlayfs, I suppose. The big
> > > problem is 2. The overlayfs code emulates the permission checks,
> > > which makes it rather complex (this is where you get your bugs like
> > > the above from). I did actually look at adding 2. to overlayfs on
> > > the theory that a single layer overlay might be closest to what
> > > this is, but eventually concluded I'd have to take the special
> > > cases and add a whole lot more to them ... it really would increase
> > > the maintenance burden substantially and make the code an
> > > unreadable rats nest.
> >
> > Hi James,
> >
> > If we merge this functionality in overlayfs, then we could avoid
> > extra copy of dentry/inode and that might be a significant advantage.
>
> I made that argument to Viro originally when I tried to do all lookups
> via the underlying cache. In the end, it's 192 bytes per dentry and
> 584 per inode, all of which are reclaimable, so it's not much of an
> advantage and it is a great simplification to the code. In general if
> you have a natural separation, you should make the layers reflect it.

ok.

>
> My container use case doesn't use overlayfs currently, so to me it
> wouldn't provide any advantage whatsoever.

In docker and other use cases, this probably will be used in conjunction
with overlayfs as containers would like to write data and that should not
go back to image dir and should be sent to container specific dir.

>
> > W.r.t permission checks, I am wondering will it make sense to do what
> > overlayfs is doing for shiftfs. That is permission is checked on
> > two inodes. We use creds of task for checking permission on
> > shiftfs/overlay inode and mounter's creds on real inode.
>
> The mounter's creds for overlay are usually admin ones, so it's local
> permission check asks should I? and the later one asks can I? (as in
> would my original admin creds allow this). In many ways, overlayfs is
> ignoring the fact that the underlying ->permissions() call might have
> failed for some good reason on the current creds. I don't think any
> serious trouble results from this but it strikes me as icky.

So we do call ->permission() of underlying inode but with the creds of
mounter (as you noted). Given we don't call reali->permission() with
the creds of task, it resulted in issues with disk quota. mounter
had CAP_SYS_RESOURCE so disk quota was being ignored. But that's easily
fixable by taking away CAP_SYS_RESOURCE from mounter's creds if caller
does not have CAP_SYS_RESOURCE.

>
> > Given we have already shifted the uid/gid for shiftfs inode, I am
> > wondering that why can't we simply call generic_permission(shiftfs_in
> > ode, mask) directly in the context of caller. Something like..
> >
> > shiftfs_permission() {
> > err = generic_permission(inode, mask);
> > if (err)
> > return err;
> >
> > switch_to_mounter_creds;
> > err = inode_permission(reali, mask);
> > revert_creds();
> >
> > return err;
> > }
>
> Because if the reali->d_iop->permission exists, you should use it. You
> could argue shiftfs_permission should be
>
> if (iop->permission) {
> oldcred = shiftfs_new_creds(&newcred, inode->i_sb);
> err = iop->permission(reali, mask);
> shiftfs_old_creds(oldcred, &newcred);
> } else
> err = generic_permission(inode, mask);
>
> But really that's a small optimisation.

ok. I thought using mounter's creds for real inode checks, will probably
do away with need of modifying caller's user namespace in
shiftfs_get_up_creds().

cred->fsuid = KUIDT_INIT(from_kuid(sb->s_user_ns, cred->fsuid));
cred->fsgid = KGIDT_INIT(from_kgid(sb->s_user_ns, cred->fsgid));
cred->user_ns = ssi->userns;

IIUC, we are shifting caller's fsuid and fsgid into caller's user
namespace but at the same time using the user_ns of reali->sb->sb_user_ns.
Feels little twisted to me. May be I am misunderstanding it.

Two levels of checks will simplify this a bit. Top level inode will belong
to the user namespace of caller and checks should pass. And mounter's
creds will have ownership over the real inode so no additional namespace
shifting required there. We could also save these creds at mount time
once and don't have to prepare it for every call. (not sure if it has
significant performance issue or not). Just thinking aloud...

Vivek