Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint

From: Al Viro
Date: Thu Apr 23 2015 - 14:08:11 EST


On Thu, Apr 23, 2015 at 05:45:44PM +1000, NeilBrown wrote:

> follow_link calls link_path_walk -> walk_component -> lookup_fast which sets
> nd->seq. Is that not enough? I guess not when nd_jump_link is called. Is
> that what I missed?

No. Potential nd_jump_link() callers are just refusing to go there in lazy
mode, end of story. That's not the problem; see below.

> One thing that is clear to me is that I don't really understand all the
> handling of 'seq' numbers, making me unable to comment usefully.
> I'll try to go through the current code and you current patch with that issue
> in mind and make sure I do understand it. Then I'll try to comment.

OK, here's the basic theory behind the lazy pathwalk:

* during the entire exercise we never drop rcu_read_lock, therefore any
objects that have all references to them removed before RCU-delayed
freeing (inodes, dentries, superblocks and vfsmounts are among such)
that we might find in process won't be freed until after we are done.

* invariant we are keeping:
at some point between the beginning of walk and now the pathname
traversed so far would lead to nd->path, with nd->seq and nd->inode equal to
->d_seq and ->d_inode of the dentry in question.

* path_init() arranges for that to be true in the beginning of the walk.

* symlinks aside, walk_component() preserves that.
+ normal name components go through lookup_fast(), where we have
__d_lookup_rcu() find a child of current nd->path with matching
name and (atomically) picks ->d_seq of that child, which had the
name matching our component. Atomicity is provided by ->d_lock
on child. Then we proceed to pick ->d_inode (and verify that
->d_seq has not changed, thus making sure that ->d_inode value
at the moment when the name matched had been the same and child
is still hashed. Then we verify that parent's ->d_seq has not
changed, guaranteeing that parent hadn't been moved or unhashed
from the moment we'd found it until after we'd found its child.
Assuming nothing's mounted on top of that thing, and there's no
problem with ->d_revalidate()), that's it - we have new nd->seq,
nd->path and nd->inode satisfying our invariant for longer
piece of pathname.
+ "." needs nothing - we just stay where we are
+ ".." is handled by follow_dotdot_rcu(), which in the normal case
(no mountpoint crossing) picks ->d_parent of where we are,
fetches its ->d_inode and ->d_seq and verifies that our directory
still hadn't changed _its_ ->d_seq. The last part guarantees that
it hadn't been moved since the time we'd found it, and thus its
->d_parent had remained unchanged at least until that verification.
Therefore, it remained pinned all along, and it ->d_inode had
remained stable, including the moment when we fetched ->d_seq.
Which means that the value we had picked *and* its ->d_inode and
->d_seq would satisfy the invariant for the longer piece of
pathname.
+ mountpoint crossing towards leaves is handled in __follow_mount_rcu();
it is simple (->mnt_root never changes and is always pinned,
stabilizing its ->d_inode), but we might need to worry about
automount points *and* need to make sure that we stop right
there if mount_lock has been bumped. See commit b37199e6 for
the details on the last part - basically, making sure that false
negatives from __lookup_mnt() won't end up with hard error when
we walk into whatever had been under the mountpoint we'd missed.
+ mountpoint crossing towards root (in follow_dotdot_rcu()) is
similar, but there we obviously don't care about automounts.
Looking at it now, it might make sense to recheck mount_lock there
as well, though - potential danger is to hit the moment when
mnt_parent and mnt_mountpoint are out of sync, leaving us with
mismatched vfsmount,dentry pair. Generally, that will be caught
when we try to leave lazy mode (legitimize_mnt() will fail) or
to cross towards leaves, but the next crossing towards root
might be bogus as well, and we could end up with unwarranted hard
error. Should be very hard to hit, but it's easy enough to check
*and* backport, so it looks like a good idea for -stable. Linus,
do you have any objections against adding
if (read_seqretry(&mount_lock, nd->m_seq))
goto failed;
right after
if (!follow_up_rcu(&nd->path))
break;
in follow_dotdot_rcu()? It's a very narrow race with mount --move
and most of the time it ends up being completely harmless, but
it's possible to construct a case when we'll get a bogus hard error
instead of falling back to non-lazy walk... OTOH, anyone doing
that will get something inherently timing-dependent as the result,
lazy mode or not. I'm in favour of adding that check, but IMO it's
not a critical problem.

* if we find that we can't continue in lazy mode because some verification
fails, we just fail with -ECHILD. However, in cases when our current position
might be fine, but the next step can't be done in lazy mode, we attempt to
fall back to non-lazy without full restart that would be caused by -ECHILD.
That's what unlazy_walk() is. Of course, if we reach the end of walk we
need to leave the lazy mode as well (complete_walk()). Either can fail,
and such a failure means restart from scratch in non-lazy mode. We need
to grab references on vfsmount and dentry (or two dentries, when we have
parent and child to deal with). The interesting part is vfsmount - we need
to make sure we won't interfere with umount(2), having walked in that sucker
*after* umount(2) has checked that it's not busy. See legitimize_mnt() for
details; basically, we have umount(2) mark the "I've verified it's not busy
and it's going to be killed no matter what" with MNT_SYNC_UMOUNT and rely
on RCU delays on that path - if we find one of those, we undo the
increment of its refcount we'd just done, without having dropped
rcu_read_lock(). Full-blown mntput() is done only on mismatches that
had _not_ been marked that way.

BTW, something similar to the above probably needs to be turned into coherent
text, either in parallel to Documentation/filesystems/path-lookup.txt, or
as update to it.

The reason why walk_component() in your call chain won't suffice is simple -
it will fail this
/*
* This sequence count validates that the parent had no
* changes while we did the lookup of the dentry above.
*
* The memory barrier in read_seqcount_begin of child is
* enough, we can use __read_seqcount_retry here.
*/
if (__read_seqcount_retry(&parent->d_seq, nd->seq))
return -ECHILD;
in lookup_fast(), just before updating nd->seq to new value. Basically,
it has no way to tell if parent has been buggered to hell and back.

It's not _that_ hard to prevent - we just need to stop discarding the parent's
seq number in "need to follow a symlink" case of walk_component(). Will take
some massage, but not too much...
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/