Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint

From: NeilBrown
Date: Fri Apr 24 2015 - 02:35:51 EST


On Thu, 23 Apr 2015 19:07:55 +0100 Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:

> On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:
>
> > follow_link calls link_path_walk -> walk_component -> lookup_fast which sets
> > nd->seq. Is that not enough? I guess not when nd_jump_link is called. Is
> > that what I missed?
>
> No. Potential nd_jump_link() callers are just refusing to go there in lazy
> mode, end of story. That's not the problem; see below.
>
> > One thing that is clear to me is that I don't really understand all the
> > handling of 'seq' numbers, making me unable to comment usefully.
> > I'll try to go through the current code and you current patch with that issue
> > in mind and make sure I do understand it. Then I'll try to comment.
>
> OK, here's the basic theory behind the lazy pathwalk:
>
> * during the entire exercise we never drop rcu_read_lock, therefore any
> objects that have all references to them removed before RCU-delayed
> freeing (inodes, dentries, superblocks and vfsmounts are among such)
> that we might find in process won't be freed until after we are done.
>
> * invariant we are keeping:
> at some point between the beginning of walk and now the pathname
> traversed so far would lead to nd->path, with nd->seq and nd->inode equal to
> ->d_seq and ->d_inode of the dentry in question.

Thanks for all of this!
I think one part that was confusing me is that there is a place where nd->seq
does related to nd->path.dentry.
Usually the two are updated at the same time.
However updated nd->seq, but *doesn't* update nd->path.dentry.
Instead, the dentry is stored in a separate 'path' variable.
This discrepancy is eventually resolved in a call to path_to_nameidata().

Which is a bit confusing until you know what is happening :-)


>
> * path_init() arranges for that to be true in the beginning of the walk.
>
> * symlinks aside, walk_component() preserves that.
> + normal name components go through lookup_fast(), where we have
> __d_lookup_rcu() find a child of current nd->path with matching
> name and (atomically) picks ->d_seq of that child, which had the
> name matching our component. Atomicity is provided by ->d_lock
> on child. Then we proceed to pick ->d_inode (and verify that

I don't see d_lock being taken in __d_lookup_rcu.
I think this atomicity is provided by ->d_seq on the child.


> ->d_seq has not changed, thus making sure that ->d_inode value
> at the moment when the name matched had been the same and child
> is still hashed. Then we verify that parent's ->d_seq has not
> changed, guaranteeing that parent hadn't been moved or unhashed
> from the moment we'd found it until after we'd found its child.
> Assuming nothing's mounted on top of that thing, and there's no
> problem with ->d_revalidate()), that's it - we have new nd->seq,
> nd->path and nd->inode satisfying our invariant for longer
> piece of pathname.
> + "." needs nothing - we just stay where we are
> + ".." is handled by follow_dotdot_rcu(), which in the normal case
> (no mountpoint crossing) picks ->d_parent of where we are,
> fetches its ->d_inode and ->d_seq and verifies that our directory
> still hadn't changed _its_ ->d_seq. The last part guarantees that
> it hadn't been moved since the time we'd found it, and thus its
> ->d_parent had remained unchanged at least until that verification.
> Therefore, it remained pinned all along, and it ->d_inode had
> remained stable, including the moment when we fetched ->d_seq.
> Which means that the value we had picked *and* its ->d_inode and
> ->d_seq would satisfy the invariant for the longer piece of
> pathname.
> + mountpoint crossing towards leaves is handled in __follow_mount_rcu();
> it is simple (->mnt_root never changes and is always pinned,
> stabilizing its ->d_inode), but we might need to worry about
> automount points *and* need to make sure that we stop right
> there if mount_lock has been bumped. See commit b37199e6 for
> the details on the last part - basically, making sure that false
> negatives from __lookup_mnt() won't end up with hard error when
> we walk into whatever had been under the mountpoint we'd missed.
> + mountpoint crossing towards root (in follow_dotdot_rcu()) is
> similar, but there we obviously don't care about automounts.
> Looking at it now, it might make sense to recheck mount_lock there
> as well, though - potential danger is to hit the moment when
> mnt_parent and mnt_mountpoint are out of sync, leaving us with
> mismatched vfsmount,dentry pair. Generally, that will be caught
> when we try to leave lazy mode (legitimize_mnt() will fail) or
> to cross towards leaves, but the next crossing towards root
> might be bogus as well, and we could end up with unwarranted hard
> error. Should be very hard to hit, but it's easy enough to check
> *and* backport, so it looks like a good idea for -stable. Linus,
> do you have any objections against adding
> if (read_seqretry(&mount_lock, nd->m_seq))
> goto failed;
> right after
> if (!follow_up_rcu(&nd->path))
> break;
> in follow_dotdot_rcu()? It's a very narrow race with mount --move
> and most of the time it ends up being completely harmless, but
> it's possible to construct a case when we'll get a bogus hard error
> instead of falling back to non-lazy walk... OTOH, anyone doing
> that will get something inherently timing-dependent as the result,
> lazy mode or not. I'm in favour of adding that check, but IMO it's
> not a critical problem.
>
> * if we find that we can't continue in lazy mode because some verification
> fails, we just fail with -ECHILD. However, in cases when our current position
> might be fine, but the next step can't be done in lazy mode, we attempt to
> fall back to non-lazy without full restart that would be caused by -ECHILD.
> That's what unlazy_walk() is. Of course, if we reach the end of walk we
> need to leave the lazy mode as well (complete_walk()). Either can fail,
> and such a failure means restart from scratch in non-lazy mode. We need
> to grab references on vfsmount and dentry (or two dentries, when we have
> parent and child to deal with). The interesting part is vfsmount - we need
> to make sure we won't interfere with umount(2), having walked in that sucker
> *after* umount(2) has checked that it's not busy. See legitimize_mnt() for
> details; basically, we have umount(2) mark the "I've verified it's not busy
> and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
> on RCU delays on that path - if we find one of those, we undo the
> increment of its refcount we'd just done, without having dropped
> rcu_read_lock(). Full-blown mntput() is done only on mismatches that
> had _not_ been marked that way.
>
> BTW, something similar to the above probably needs to be turned into coherent
> text, either in parallel to Documentation/filesystems/path-lookup.txt, or
> as update to it.

That might be fun.... if I find time.


>
> The reason why walk_component() in your call chain won't suffice is simple -
> it will fail this
> /*
> * This sequence count validates that the parent had no
> * changes while we did the lookup of the dentry above.
> *
> * The memory barrier in read_seqcount_begin of child is
> * enough, we can use __read_seqcount_retry here.
> */
> if (__read_seqcount_retry(&parent->d_seq, nd->seq))
> return -ECHILD;
> in lookup_fast(), just before updating nd->seq to new value. Basically,
> it has no way to tell if parent has been buggered to hell and back.
>
> It's not _that_ hard to prevent - we just need to stop discarding the parent's
> seq number in "need to follow a symlink" case of walk_component(). Will take
> some massage, but not too much...

So.....
Where I have:

+ if (nd->flags & LOOKUP_RCU) {
+ if (!nd->root.mnt)
+ set_root_rcu(nd);
+ nd->path = nd->root;


in the case where the symlink starts '/', I need to set nd->seq

nd->seq = read_seqcount_begin(&nd->path.dentry->d_seq);


but in the case where symlink doesn't start '/', I don't change nd->path,
so nd->seq should still be correct?

Also, just after that (still in follow_link()), we have

nd->inode = nd->path.dentry->d_inode;

*after* the test for (*s == '/').
Wouldn't is be OK to have that *inside* the if? In the relative-symlink case
it should be unchanged, and in the jump-symlink case nd_jump_link() has already
done that (and probably NULL was returns for 's' so this code doesn't execute).

So following on top of my previous patches?

Thanks heaps,
NeilBrown


diff --git a/fs/namei.c b/fs/namei.c
index d13b4315447f..ce6387d5317c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -947,6 +947,8 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
if (!nd->root.mnt)
set_root_rcu(nd);
nd->path = nd->root;
+ nd->seq = read_seqcount_begin(
+ &nd->path->dentry->d_seq);
} else {
if (!nd->root.mnt)
set_root(nd);
@@ -954,9 +956,9 @@ follow_link(struct path *link, struct nameidata *nd, void **p)
nd->path = nd->root;
path_get(&nd->root);
}
+ nd->inode = nd->path.dentry->d_inode;
nd->flags |= LOOKUP_JUMPED;
}
- nd->inode = nd->path.dentry->d_inode;
error = link_path_walk(s, nd);
if (unlikely(error))
put_link(nd, link, inode, *p);

Attachment: pgpnPFfb7Coyu.pgp
Description: OpenPGP digital signature