Re: [RFC] dealing with proc_ns_follow_link() and "namespace" dentries

From: Eric W. Biederman
Date: Mon Oct 13 2014 - 18:43:22 EST

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

> But more fundamentally, this stuff has no business being in procfs. The
> *only* reason why we are putting them there (and get those inodes and
> dentries duplicated for different procfs instances) is this in
> do_loopback():
> if (!check_mnt(parent) || !check_mnt(old))
> goto out2;
> You want to be able to bind those suckers by mount --bind. Which is an
> odd API design (and it gives you another headache from hell with mnt_ns_loop())
> but it's too late to change, sadly ;-/ So you want ->follow_link() to
> yield dentries matching some vfsmount in our namespace. The things would be
> much simpler if we didn't try that.
> Look: do_loopback() is *already* aware of those suckers. Has to be, due to
> aforementioned mnt_ns_loop(). So we might as well weaken the constraints on
> old - check_mnt(old) || proc_ns_inode(old_path.dentry->d_inode) would do.
> The thing is, we calculate that proc_ns_inode() anyway.

Truthfully mnt_ns_loop only exists for because we have these for the
mount namespace. But I take your point.

Your proposal sounds like it should work. After the bind mount is
created we have a new struct mount so the existing chk_mnt checks won't
get us into trouble.

At some point we may need to allow multiple internal superblocks to
allow having a preset inode number for the purposes of
checkpoint/restart but I don't see that being a bottleneck to process
migration anytime in the near future, and your design does not preclude
that possibility.

There was a potential issue with your earlier suggestion to modify
check_mnt to only include mounted filesystems as otherwise there is an
issue with references to filesystems unmounted with MNT_DETACH being
able to be bind mounted. Having an explicit proc_ns_inode test avoids
those issues.

> Now, suppose we'd done that. In principle, it allows some things that are
> not allowed right now - you could open such file and pass it (via SCM_RIGHTS)
> to a process in another namespace. With the current tree you can't bind
> that bugger via /proc/self/fd/<n> - it will have alien vfsmount. After such
> change it will become allowed. Is there any problem with that? I don't see
> any (after all, recepient might have held it open and anyone who could
> get to that sucker via /proc/<pid>/fd/<n> could've simply stopped the recepient
> and made it pass the damn thing to them - being able to ptrace is enough
> for that). Am I missing anything?

The only thing you can't do with a passed file descriptor today is to
mount it. Everything else is already doable in any context, and all
mounting does is give you the ability to keep the namespace alive.
Something you can already do by just keeping the file descriptor open.

These filedescriptors are guarded by directory permissions so if
/proc/<pid>/fd had more open permissions I might be concerned, but
as it stands only the user of the process possesors of the appropriate
capability can access that directory.

So I really think we are good. Thank you for making such a careful

> *IF* we can get away with such change, everything becomes much easier. We
> can add a filesystem just for those guys - one for the entire system. And
> kern_mount() it. Don't bother with register_filesystem() - no point showing
> it in the /proc/filesystems, etc. No need to abuse procfs inodes either.
> Or procfs icache, for that matter. All we need is a small standard piece
> embedded into each {mnt,net,pid,user,...}ns.
> Namely, let's stash a pointer to such dentry (if any) into that standard piece.
> Not a counting reference, of course, and used in a very limited way.
> * have ->d_prune() for those guys replace the stashed pointer with NULL.
> * on the "get dentry for ns" side,
> again:
> rcu_read_lock()
> d = atomic_long_read(stashed pointer)
> if (!d)
> goto slow;
> if (!lockref_get_not_dead(&d->d_lockref))
> goto slow;
> rcu_read_unlock();
> return d;
> slow:
> rcu_read_unlock();
> allocate new inode
> allocate new dentry
> d_instantiate()
> d = atomic_long_cmpxchg(stashed pointer, NULL, new dentry);
> if (d) {
> dput(new dentry);
> cpu_relax(); // might not be needed
> goto again;
> }
> return new dentry;
> I'd suggest having ->i_private point to that standard piece and storing
> ns_ops and proc_inum in there.

I think it is a shame that we don't have an equivalent of
d_materialise_unique that does thoe whole d_find_alias logic
non-directories. Sigh. But what should drive which functions to keep
are the real and needs of filesystems not my weird namespace case.

>From looking at your proposed code that doesn't look like it will be
any more difficult to maintain from the namespace side. So I have no
objects with moving the code in that direction.

> It's obviously a project for the next cycle and it'll require
> some cooperation between vfs and userns trees, but not too much of it -
> all we really need is a never-changed commit embedding that structure into
> all ...ns and passing _that_ to proc_{alloc,free}_inum() replacements
> Merged both into vfs.git #for-next and usern one. We can do that right
> after -rc1. Incidentally, it might make sense to move refcount into that
> common piece as well, but that's independent from the stuff above.

That sounds like a reasonable direction to go. I think your schedule
may be a tad optimistic time-wise, if for no other reason that code
reviews take time, but that plan should work.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at