[RFC][PATCHSET] non-recursive pathname resolution

From: Al Viro
Date: Tue May 05 2015 - 01:22:17 EST


My apologies for the patchbomb from hell, but I really think
it needs a public review.

There are several things in that queue; most of it is fs/namei.c
rewrite and closer to the end we get
* non-recursive link_path_walk()
* the only symlink-related limitation is that there should be no
more than 40 symlinks traversals, no matter how nested they are.
* stack footprint independent from the nesting and about on par with
mainline in "no symlinks" case.
* some work towards the symlink traversal without dropping from
RCU mode; not complete, but much better starting point than the current
mainline.
* IMO more readable logics around the pathname resolution in
general, but then I'd spent the last couple of weeks messing with that code,
so my perception might be seriously warped.

Most of that stuff is mine, with exception of 3 commits from
Neil's series (with some modifications). I've marked those with [N] in
the list below.

It really needs review and more testing; it *does* pass xfstests,
LTP and local tests with no regressions, but at the very least it'll need
careful profiling, as well as more eyes to read it through. The whole
thing is based at -rc1. Individual patches are in followups; those who
prefer to access it via git can fetch it at
git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git link_path_walk

Please, read and comment.

Roadmap:

* assorted preliminary patches:

[1/79] 9p: don't bother with 4K allocation for 24-byte local array...
[2/79] 9p: don't bother with __getname() in ->follow_link()
[3/79][N] ovl: rearrange ovl_follow_link to it doesn't need to call ->put_link
[4/79] ext4: split inode_operations for encrypted symlinks off the rest

* getting rid of boilerplate for inline symlinks

[5/79] libfs: simple_follow_link()
[6/79] ext2: use simple_follow_link()
[7/79] befs: switch to simple_follow_link()
[8/79] ext3: switch to simple_follow_link()
[9/79] ext4: switch to simple_follow_link()
[10/79] jffs2: switch to simple_follow_link()
[11/79] shmem: switch to simple_follow_link()
[12/79] debugfs: switch to simple_follow_link()
[13/79] ufs: switch to simple_follow_link()
[14/79] ubifs: switch to simple_follow_link()
[15/79] sysv: switch to simple_follow_link()
[16/79] jfs: switch to simple_follow_link()
[17/79] freevxfs: switch to simple_follow_link()
[18/79] exofs: switch to {simple,page}_symlink_inode_operations
[19/79] ceph: switch to simple_follow_link()

* more assorted preparations:
[20/79] logfs: fix a pagecache leak for symlinks
[21/79][N] SECURITY: remove nameidata arg from inode_follow_link.
[22/79] uninline walk_component()

* getting trailing symlinks handling in do_last() into saner shape:
[23/79] namei: take O_NOFOLLOW treatment into do_last()
[24/79] do_last: kill symlink_ok
[25/79] do_last: regularize the logics around following symlinks

* getting rid of nameidata on stack during ->rmdir()/->unlink() and double
nameidata during ->rename():
[26/79] namei: get rid of lookup_hash()
[27/79] name: shift nameidata down into user_path_walk()

* single instance of nameidata across the adjacent path_mountpoint() calls:
[28/79] namei: lift nameidata into filename_mountpoint()

At that point preparations end. The next one changes the calling conventions
for ->follow_link/->put_link. We still are passing nameidata to
->follow_link(), but almost all instances do not use it anymore. Moreover,
nd->saved_names[] is gone and so's nameidata in generic_readlink().
That one has the biggest footprint outside of fs/namei.c - it's a flagday
change.

[29/79] new ->follow_link() and ->put_link() calling conventions

* some preparations for link_path_walk() massage:
[30/79] namei.c: separate the parts of follow_link() that find the link body
[31/79] namei: don't bother with ->follow_link() if ->i_link is set
[32/79] namei: introduce nameidata->link
[33/79] do_last: move path there from caller's stack frame

* link_path_walk()/follow_link() rewrite. It's done in small steps,
getting the damn thing into form that would allow to get rid of recursion:

[34/79] namei: expand nested_symlink() in its only caller
[35/79] namei: expand the call of follow_link() in link_path_walk()
[36/79] namei: move the calls of may_follow_link() into follow_link()
[37/79] namei: rename follow_link to trailing_symlink, move it down
[38/79] link_path_walk: handle get_link() returning ERR_PTR() immediately
[39/79] link_path_walk: don't bother with walk_component() after jumping link
[40/79] link_path_walk: turn inner loop into explicit goto
[41/79] link_path_walk: massage a bit more
[42/79] link_path_walk: get rid of duplication
[43/79] link_path_walk: final preparations to killing recursion

* dumb and staightforward "put all variables we needed to be separate in
each frame into an array of structs" killing of recursion. Array is still
on stack frame of link_path_walk(), we are still with the same depth limits,
etc. All of that will change shortly afterwards.
[44/79] link_path_walk: kill the recursion

* post-operation cleanup:
[45/79] link_path_walk: split "return from recursive call" path
[46/79] link_path_walk: cleanup - turn goto start; into continue;

* array moves into nameidata, with extra (0th) element used instead of
link/cookie pairs of local variables we used to have in trailing symlink
loops:
[47/79] namei: move link/cookie pairs into nameidata

* more post-op cleanup - arguments of several functions are always the
fields of the same array element or pointers to such:
[48/79] namei: trim redundant arguments of trailing_symlink()
[49/79] namei: trim redundant arguments of fs/namei.c:put_link()
[50/79] namei: trim the arguments of get_link()

* trim the array in nameidata to a couple of elements, use it until we
need more and allocate a big (41-element) array when needed. Result:
no more nesting limitations.
[51/79] namei: remove restrictions on nesting depth

* the use of array is seriously counterintuitive - for everything except
the last component we are _not_ using the first element of array; moreover,
way too many places are manipulating nd->depth. I've tried to sanitize
all of that in one patch, but after several failures I ended up splitting
the massage into smaller steps. The composition of those has ended up
fairly small, but convincing yourself that it's correct will mean reporoducing
this splitup ;-/
[52/79] link_path_walk: nd->depth massage, part 1
[53/79] link_path_walk: nd->depth massage, part 2
[54/79] link_path_walk: nd->depth massage, part 3
[55/79] link_path_walk: nd->depth massage, part 4
[56/79] trailing_symlink: nd->depth massage, part 5
[57/79] get_link: nd->depth massage, part 6
[58/79] trailing_symlink: nd->depth massage, part 7
[59/79] put_link: nd->depth massage, part 8
[60/79] link_path_walk: nd->depth massage, part 9
[61/79] link_path_walk: nd->depth massage, part 10
[62/79] link_path_walk: end of nd->depth massage
[63/79] namei: we never need more than MAXSYMLINKS entries in nd->stack

* getting more regular rules for calling terminate_walk()/put_link().
The reason for that part is (besides getting simpler logics in the end) that
with RCU-symlink series we'll really need to have terminate_walk() trigger
put_link() on everything we have on stack:

[64/79] namei: lift (open-coded) terminate_walk() in follow_dotdot_rcu() into
callers
[65/79] lift terminate_walk() into callers of walk_component()
[66/79] namei: lift (open-coded) terminate_walk() into callers of get_link()
[67/79] namei: take put_link() into {lookup,mountpoint,do}_last()
[68/79] namei: have terminate_walk() do put_link() on everything left
[69/79] link_path_walk: move the OK: inside the loop

* more put_link() work - basically, we are consolidating the "we decide
that we'd found a symlink that needs to be followed" logics. Again,
that's one of the areas RCU-symlink will need to modify:

[70/79] namei: new calling conventions for walk_component()
[71/79] namei: make should_follow_link() store the link in nd->link
[72/79] namei: move link count check and stack allocation into pick_link()

At that point the parts of the core work I consider reasonably stable
are over. Next 4 commits are about not passing nameidata to ->follow_link() -
it's the same situation as with pt_regs and IRQ handlers and the same kind
of solution. Only 2 instances out of 26 use nameidata at all, and only
to pass it to nd_jump_link(); seeing that we already have the "begin
using this nameidata instance/end using it" pairs bracketing the areas
where thread works with given instance (we need that for sane freeing
of on-demand allocated nd->stack), the solution is obvious...

[73/79] lustre: rip the private symlink nesting limit out
[74/79][N] VFS: replace {, total_}link_count in task_struct with pointer to
nameidata
[75/79] namei: simplify the callers of follow_managed()
[76/79] don't pass nameidata to ->follow_link()

The rest is very preliminary bits and pieces for RCU-symlink; it's almost
certainly going to be replaced and can be pretty much ignored for now:

[77/79] namei: move unlazying on symlinks towards get_link() call sites
[78/79] namei: new helper - is_stale()
[79/79] namei: stretch RCU mode into get_link()
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/