Re: [RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint

From: NeilBrown
Date: Thu Apr 23 2015 - 03:45:59 EST


On Wed, 22 Apr 2015 22:05:53 +0100 Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:

> On Wed, Apr 22, 2015 at 09:12:38PM +0100, Al Viro wrote:
> > On Wed, Apr 22, 2015 at 07:07:59PM +0100, Al Viro wrote:
> > > And one more: may_follow_link() is now potentially oopsable. Look: suppose
> > > we've reached the link in RCU mode, just as it got unlinked. link->dentry
> > > has become negative and may_follow_link() steps into
> > > /* Allowed if owner and follower match. */
> > > inode = link->dentry->d_inode;
> > > if (uid_eq(current_cred()->fsuid, inode->i_uid))
> > > return 0;
> > > Oops... Incidentally, I suspect that your __read_seqcount_retry() in
> > > follow_link() might be lacking a barrier; why isn't full read_seqcount_retry()
> > > needed?

Blind copy-paste error I suspect.

> > >
> > > FWIW, I would rather fetch ->d_inode *and* checked ->seq proir to calling
> > > get_link(), and passed inode to it as an explicit argument. And passed it
> > > to may_follow_link() as well...
> >
> > Hrm... You know, something really weird is going on here. Where are
> > you setting nd->seq? I don't see anything in follow_link() doing that.

follow_link calls link_path_walk -> walk_component -> lookup_fast which sets
nd->seq. Is that not enough? I guess not when nd_jump_link is called. Is
that what I missed?


> > And nd->seq _definitely_ needs setting if you want to stay in RCU mode -
> > at that point it matches the dentry of symlink, not that of nd->path
> > (== parent directory). Neil, could you tell me which kernel you'd been
> > testing (ideally - commit ID is a public git tree), what config and what
> > tests had those been?

The 'devel' branch of git://neil.brown.name/md, which is based on 4.0-rc7.

Tests have been fairly causal so far, making sure I can follow symlinks on a
small variety of filesystems, and running a particular test which repeatedly
stats a large number of non-existent files with paths that go through a
symlink.


>
> FWIW, there's a wart that had been annoying me for quite a while, and it
> might be related to dealing with that properly. Namely, walk_component()
> calling conventions. We have walk_component(nd, &path, follow), which can
> * return -E..., and leave us with pathwalk terminated; path contains
> junk, and so does nd->path.
> * return 0, update nd->path, nd->inode and nd->seq. The contents
> of path is in undefined state - it might be unchanged, it might be equal to
> nd->path (and not pinned down, RCU mode or not). In any case, callers do
> not touch it afterwards. That's the normal case.
> * return 1, update nd->seq, leave nd->path and nd->inode unchanged and
> set path pointing to our symlink. nd->seq matches path, not nd->path.
>
> In all cases the original contents of path is ignored - it's purely 'out'
> parameter, but compiler can't establish that on its own; it _might_ be
> left untouched. In all cases when its contents survives we don't look at
> it afterwards, but proving that requires a non-trivial analysis.
>
> And in case when we return 1 (== symlink to be followed), we bugger nd->seq.
> It's left as we need it for unlazy_walk() (and after unlazy_walk() we don't
> care about it at all), so currently everything works, but if we want to
> stay in RCU mode for symlink traversal, we _will_ need ->d_seq of parent
> directory.
>
> I wonder if the right way to solve that would be to drop the path argument
> entirely and store the bugger in nameidata. As in
> union {
> struct qstr last;
> struct path link;
> };
> ...
> union {
> int last_type;
> unsigned link_seq;
> };
> in struct nameidata. We never need both at the same time; after
> walk_component() (or its analogue in do_last()) we don't need the component
> name anymore. That way walk_component() would not trash nd->seq when
> finding a symlink...
>
> It would also shrink the stack footprint a bit - local struct path next
> in link_path_walk() would be gone, along with the same thing in path_lookupat()
> and friends. Not a lot of win (4 pointers total), but it might be enough
> to excuse putting ->d_seq of root in there, along with ->link.dentry->d_inode,
> to avoid rechecking its ->d_seq. As the matter of fact, we do have an
> odd number of 32bit fields in there, so ->d_seq of root would fit nicely...
>
> Comments?

One thing that is clear to me is that I don't really understand all the
handling of 'seq' numbers, making me unable to comment usefully.
I'll try to go through the current code and you current patch with that issue
in mind and make sure I do understand it. Then I'll try to comment.

Meanwhile, I like your suggestion in separate email to legitimize the whole
symlink stack when unlazy_walk is needed. One reason that I didn't even try
that originally is that it was not practical to walk the stack to find
everything that needed legitimizing. Now that you have that in an explicit
stack it is at least work looking into.

Thanks,
NeilBrown

Attachment: pgpmUGNVur2X6.pgp
Description: OpenPGP digital signature