[RFC] dealing with proc_ns_follow_link() and "namespace" dentries

From: Al Viro
Date: Mon Oct 13 2014 - 02:30:00 EST

proc_ns_follow_link() is unique among the ->follow_link() instances
in several respects, some of them violating general asserts expected by
all kinds of VFS code.

First of all, it can lead to <vfsmount,dentry> pairs with dentry
being completely unrelated to vfsmount. Suppose we bind such a symlink over
something else. ->follow_link() is called with nd->path set to directory
we'd found the symlink in - after all, that's what the normal symlinks are
interpreted against. The same symlink present in two different directories
can resolve to different places. In this case we generate a dentry/inode in
procfs and combine that dentry with vfsmount of parent. Which might be
completely unrelated to procfs.

The thing is, procfs isn't a good match for those inodes either.
For one thing, they really don't give a damn which procfs instance are
they in - if anything, they should be associated with ns. For another,
using procfs icache for looking them up (by really strange inumbers)
solves the problem only partially. d_instantiate_unique(), used there in
attempt to keep dentries from breeding, does not do what you seem to expect
it to do. Namely, it *never* picks an existing alias. Why? Because it
looks for aliases with the same ->d_parent (and name) as the dentry we'd
given it. And that dentry comes from d_alloc_pseudo(), i.e. it has
itself for ->d_parent. *All* aliases for those inodes happen that way.
IOW, none of them have the same ->d_parent, and d_instantiate_unique()
here is simply a fancy way to spin through all existing aliases and call
d_instantiate() on our new one. Quick experiment:
int i;
for (i = 0; i < 100000; i++) {
open("/proc/self/ns/mnt", 0);
if (!(i % 1000))
system("grep dentry /proc/slabinfo");
run with ulimit -n 100000 and watch it create a new dentry per open().
Profile of that (sans system(3)) is also rather interesting -
each d_instantiate_unique() is O(N), so we spend a quadratic time in
that sucker for no good reason whatsoever.

If nothing else, d = d_find_any_alias(inode); if (d) {dput(dentry);return d;}
d_set_d_op(dentry, ...); d_instantiate(dentry, inode); return dentry; would be
a much better strategy - *this* would almost always find an alias when one is
to be found. We might occasionally get more than one (when several processes
race), but that beats the hell out of easily getting millions of them...

But more fundamentally, this stuff has no business being in procfs. The
*only* reason why we are putting them there (and get those inodes and
dentries duplicated for different procfs instances) is this in
if (!check_mnt(parent) || !check_mnt(old))
goto out2;
You want to be able to bind those suckers by mount --bind. Which is an
odd API design (and it gives you another headache from hell with mnt_ns_loop())
but it's too late to change, sadly ;-/ So you want ->follow_link() to
yield dentries matching some vfsmount in our namespace. The things would be
much simpler if we didn't try that.

Look: do_loopback() is *already* aware of those suckers. Has to be, due to
aforementioned mnt_ns_loop(). So we might as well weaken the constraints on
old - check_mnt(old) || proc_ns_inode(old_path.dentry->d_inode) would do.
The thing is, we calculate that proc_ns_inode() anyway.

Now, suppose we'd done that. In principle, it allows some things that are
not allowed right now - you could open such file and pass it (via SCM_RIGHTS)
to a process in another namespace. With the current tree you can't bind
that bugger via /proc/self/fd/<n> - it will have alien vfsmount. After such
change it will become allowed. Is there any problem with that? I don't see
any (after all, recepient might have held it open and anyone who could
get to that sucker via /proc/<pid>/fd/<n> could've simply stopped the recepient
and made it pass the damn thing to them - being able to ptrace is enough
for that). Am I missing anything?

*IF* we can get away with such change, everything becomes much easier. We
can add a filesystem just for those guys - one for the entire system. And
kern_mount() it. Don't bother with register_filesystem() - no point showing
it in the /proc/filesystems, etc. No need to abuse procfs inodes either.
Or procfs icache, for that matter. All we need is a small standard piece
embedded into each {mnt,net,pid,user,...}ns.

Namely, let's stash a pointer to such dentry (if any) into that standard piece.
Not a counting reference, of course, and used in a very limited way.
* have ->d_prune() for those guys replace the stashed pointer with NULL.
* on the "get dentry for ns" side,
d = atomic_long_read(stashed pointer)
if (!d)
goto slow;
if (!lockref_get_not_dead(&d->d_lockref))
goto slow;
return d;
allocate new inode
allocate new dentry
d = atomic_long_cmpxchg(stashed pointer, NULL, new dentry);
if (d) {
dput(new dentry);
cpu_relax(); // might not be needed
goto again;
return new dentry;
I'd suggest having ->i_private point to that standard piece and storing
ns_ops and proc_inum in there.

That way we get rid of weird <vfsmount, dentry> pairs, untangle that crap from
procfs, get rid of polluting icache with the stuff that has no reason to be
there and get rid of pointless aliases. Oh, and we get a decent chance to
kill d_instantiate_unique(), which is also nice. Or at least fold it into
d_add_unique(), if we can't kill that sucker as well. And if we manage to
kill it outright, we get rid of __d_instantiate_unique() for free - in my
local queue d_materialise_unique() had been subsumed by d_splice_alias(),
getting rid of the other caller of __d_instantiate_unique(). We have *WAY*
too many primitives in that area, and trimming that forest is definitely
a good thing.

Besides, that would allow to make struct nameidata completely opaque for
everything except fs/namei.c (by making nd_set_link() and nd_get_link()
exported instead of inlines). The only obstacle to that is that damn
->follow_link() instance with its fondling of nd->path.mnt. Would be nice
to get rid of that source of headache as well...

It's obviously a project for the next cycle and it'll require
some cooperation between vfs and userns trees, but not too much of it -
all we really need is a never-changed commit embedding that structure into
all ...ns and passing _that_ to proc_{alloc,free}_inum() replacements
Merged both into vfs.git #for-next and usern one. We can do that right
after -rc1. Incidentally, it might make sense to move refcount into that
common piece as well, but that's independent from the stuff above.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/