Re: [PATCH] netns: dangling pointer on netns bind mount point.

From: Yun Levi
Date: Wed Apr 08 2020 - 01:59:44 EST


Thank you for great comments. Thanks to you I understand what i missed.

I try to generate problem on mainline But, as you explained that
situation isn't happen,

Maybe my other things which I made generate some problem (freeing
network namespace..)

Thanks for great answering and sharing.

If I meet the situation, at that time I'll share. Thank you very much!

P.S. If I have a question, Could I ask via e-mail like this?

On Wed, Apr 8, 2020 at 3:26 AM Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, Apr 07, 2020 at 09:53:29PM +0900, Yun Levi wrote:
> > BTW, It's my question.
> >
> > >Look: we call ns_get_path(), which calls ns_get_path_cb(), which
> > >calls the callback passed to it (ns_get_path_task(), in this case),
> > >which grabs a reference to ns. Then we pass that reference to
> > >__ns_get_path().
> > >
> > >__ns_get_path() looks for dentry stashed in ns. If there is one
> > >and it's still alive, we grab a reference to that dentry and drop
> > >the reference to ns - no new inodes had been created, so no new
> > >namespace references have appeared. Existing inode is pinned
> > >by dentry and dentry is pinned by _dentry_ reference we've got.
> >
> > actually ns_get_path is called in unshare(2).
>
> Yes, it does. Via perf_event_namespaces(), which does
> perf_fill_ns_link_info(&ns_link_info[NET_NS_INDEX],
> task, &netns_operations);
> and there we have
> error = ns_get_path(&ns_path, task, ns_ops);
> if (!error) {
> ns_inode = ns_path.dentry->d_inode;
> ns_link_info->dev = new_encode_dev(ns_inode->i_sb->s_dev);
> ns_link_info->ino = ns_inode->i_ino;
> path_put(&ns_path);
> }
> See that path_put()? Dentry reference is dropped by it.
>
> > and it makes new dentry and
> > inode in __ns_get_path finally (Cuz it create new network namespace)
> >
> > In that case, when I mount with --bind option to this proc/self/ns/net, it
> > only increase dentry refcount on do_loopback->clone_mnt finally (not call
> > netns_operation->get)
> > That means it's not increase previous created network namespace reference
> > count but only increase reference count of _dentry_
> >
> > In that situation, If I exit the child process it definitely frees the
> > net_namespace previous created at the same time it decrease net_namespace's
> > refcnt in exit_task_namespace().
> > It means it's possible that bind mount point can hold the dentry with inode
> > having net_namespace's dangling pointer in another process.
> > In above situation, parent who know that binded mount point calls setns(2)
> > then it sets the net_namespace which is refered by the inode of the dentry
> > increased by do_loopback.
> > That makes set the net_namespace which was freed already.
>
> How? Netns reference in inode contributes to netns refcount. And it's held
> for as long as the _inode_ has positive refcount - we only drop it from
> the inode destructor, *NOT* every time we drop a reference to inode.
> In the similar fashion, the inode reference in dentry contributes to inode
> refcount. And again, that inode reference won't be dropped until the _last_
> reference to dentry gets dropped.
>
> Incrementing refcount of dentry is enough to pin the inode and thus the
> netns the inode refers to. It's a very common pattern with refcounting;
> a useful way of thinking about it is to consider the refcount of e.g.
> inode as sum of several components, one of them being "number of struct
> dentry instances with ->d_inode pointing to our inode". And look at e.g.
> __ns_get_path() like this:
> rcu_read_lock();
> d = atomic_long_read(&ns->stashed);
> if (!d)
> goto slow;
> dentry = (struct dentry *)d;
> if (!lockref_get_not_dead(&dentry->d_lockref))
> goto slow;
> other_count(dentry) got incremented by 1.
> rcu_read_unlock();
> ns->ops->put(ns);
> other_count(ns) decremented by 1.
> got_it:
> path->mnt = mntget(mnt);
> path->dentry = dentry;
> path added to paths(dentry), other_count(dentry) decremented by 1 (getting
> it back to the original value).
> return 0;
> slow:
> rcu_read_unlock();
> inode = new_inode_pseudo(mnt->mnt_sb);
> if (!inode) {
> ns->ops->put(ns);
> subtract 1 from other_count(ns)
> return -ENOMEM;
> }
> dentries(inode) = empty
> other_count(inode) = 1
> ....
> inode->i_private = ns;
> add inode to inodes(ns), subtract 1 from other_count(ns); the total
> is unchanged.
> dentry = d_alloc_anon(mnt->mnt_sb);
> if (!dentry) {
> iput(inode);
> subtract 1 from other_count(inode). Since now all components of
> inode refcount are zero, inode gets destroyed. Destructor calls
> nsfs_evict_inode(), which removes the inode from inodes(ns).
> The total effect: inode is destroyed, inodes(ns) is back to what
> it used to be and other_count(ns) is left decremented by 1 compared
> to what we used to have. IOW, the balance is the same as if inode
> allocation would've failed.
> return -ENOMEM;
> }
> other_count(dentry) = 1
> d_instantiate(dentry, inode);
> add dentry to dentries(inode), subtract 1 from other_count(inode).
> The total is unchanged. Now other_count(inode) is 0 and dentries(inode)
> is {dentry}.
> d = atomic_long_cmpxchg(&ns->stashed, 0, (unsigned long)dentry);
> if (d) {
> somebody else has gotten there first
> d_delete(dentry); /* make sure ->d_prune() does nothing */
> dput(dentry);
> subtract 1 from other_count(dentry) (which will drive it to 0). Since
> no other references exist, dentry gets destroyed. Destructor will
> remove dentry from dentries(inode) and since other_count(inode) is already
> zero, trigger destruction of inode. That, in turn, will remove inode
> from inodes(ns). Total effect: dentry is destroyed, inode is destroyed,
> inodes(ns) is back to what it used to be, other_count(ns) is left decremented
> by 1 compared to what we used to have.
> cpu_relax();
> return -EAGAIN;
> }
> goto got_it;
> got_it:
> path->mnt = mntget(mnt);
> path->dentry = dentry;
> add path to paths(dentry), subtract 1 from other_count(dentry). At that
> point other_count(dentry) is back to 0, ditto for other_count(inode) and
> other_count(ns) is left decremented by 1 compared to what it used to be.
> inode is added to inodes(ns), dentry - to dentries(inode) and path - to
> paths(dentry).
> return 0;
> and we are done.
>
> In all cases the total effect is the same as far as "other" counts go:
> other_count(ns) is down by 1 and that's the only change in other_count()
> of *ANY* objects. Of course we do not keep track of the sets explicitly;
> it would cost too much and we only interested in the sum of their sizes
> anyway. What we actually store is the sum, so operations like "transfer
> the reference from one component to another" are not immediately obvious
> to be refcounting ones - the sum is unchanged. Conceptually, though,
> they are refcounting operations.
> Up to d_instantiate() we are holding a reference to inode;
> after that we are *not* - it has been transferred to dentry. That's
> why on subsequent failure exits we do not call iput() - the inode
> reference is not ours to discard anymore.
> In the same way, up to inode->i_private = ns; we are holding
> a reference to ns. After that we are not - it's been transferred to
> inode. From that point on it's not ours to discard; it will be
> dropped when inode gets destroyed, whenever that happens.