[RFC][PATCHSET] non-recursive link_path_walk() and reducing stack footprint

From: Al Viro
Date: Mon Apr 20 2015 - 14:12:31 EST


The patchset below gets rid of recursion in pathname resolution.
As the result, stack footprint of fs/namei.c stuff doesn't depend upon
the symlink nesting anymore and is much less than the worst cases in
mainline - as the matter of fact, it's around the footprint mainline gets
with just one or two levels of nesting.

We could reduce it further (see below), but I'm not sure it's worth
doing - it's not much extra complexity, but we only squeeze out ~250 bytes
that way, with the worst-case footprints (those are triggered by rename())
are around 1.4Kb (instead of about twice as much in mainline). OTOH,
that extra complexity would've allowed us to get rid of the nesting limit
entirely (i.e. we get ELOOP when we encounter 40 symlinks, no matter in
which manner they are nested). That might be worth considering...

Performance of the result seems to be about on par with the mainline
and so far it has survived everything I'd hit it with. More testing is
obviously needed.

Patches 2/24..6/24 are from Neil's RCU follow_link patchset; the
rest of his patchset is, of course, derailed by the massage done here,
but AFAICS we should be able to port it on top of this one with reasonably
little PITA.

There are several fs-visible interface changes:
1) nameidata is _not_ passed to ->follow_link() and ->put_link(). We use
the same kind of approach as with pt_regs.
2) instead of "store the link body (with nd_set_link()), return an opaque
pointer to be given to ->put_link() to undo whatever we'd done to pin the
link body" ->follow_link() now does the opposite - it *returns* the symlink
body and stores whatever opaque pointer ->put_link() will need. Stores
it with nd_pin_link(cookie). Rules:
* ERR_PTR() returned => fail lookup with that error.
* NULL returned => we'd done the move ourselves, no body to traverse.
* pointer to string => symlink body for traversal. ->put_link()
will be called if non-NULL pointer had been stored by nd_pin_link().
3) ->put_link() does *not* have access to symlink body anymore. The cases
that used to rely on seeing it (kfree_put_link(), mostly) are trivially
switched to new rules - just do nd_pin_link(body); return body; and we
are fine.

I've converted all in-tree instances of ->follow_link()/->put_link()
(see 23/24) and it's actually less headache than with the mainline calling
conventions.

The main reason the series is that long is that I'm trying to keep
all steps small - it *is* a serious rewrite and I want to do it in easily
verified steps.

Overall it's fairly straightforward - instead of getting a new
stack frame for each level of nesting, we add an explicit array just for
the stuff that needs to be preserved across the recursive call - there's
not much of it (how far did we get in pathname + stuff needed for ->put_link(),
i.e. cookie and struct path of symlink itself). That array is *not*
embedded into nameidata - only a pointer to it. That allows to reduce the
size of struct nameidata, which is especially sensitive in rename(), where
we have two of those. Old nd->saved_names[] is gone - not needed anymore.
Array is held in caller (or caller of caller) stack frame. We could
go for "just keep a couple of elements there, if we need more we'll allocate
from slab", but that doesn't buy us a lot unless we remove the limit on
nesting at the same time. I hadn't done that in this series; we can always
do that later, it would be a fairly isolated modification.

For RCU follow_link porting we'll probably need to replace struct
path of symlink with union of struct path and (dentry,inode) pair. Should
be doable without blowing the stack footprint.

The scariest remaining thing about fs/namei.c stack footprint is that
we get ->d_automount() called at pretty much maximal depth; it's about half of
what it used to be in mainline (1.3Kb instead of 2.7Kb), but there's a lot
of work done by that thing (and by finish_automount()). Something like
NFS referral point encountered by rename(2) while looking for parent
directories can get really nasty...

Folks, please review; the patches (on top of vfs.git#for-next) go in followups, the entire branch is in vfs.git#link_path_walk.

Shortlog:
Al Viro (19):
lustre: rip the private symlink nesting limit out
namei: expand nested_symlink() in its only caller
namei.c: separate the parts of follow_link() that find the link body
namei: fold follow_link() into link_path_walk()
link_path_walk: handle get_link() returning ERR_PTR() immediately
link_path_walk: don't bother with walk_component() after jumping link
link_path_walk: turn inner loop into explicit goto
link_path_walk: massage a bit more
link_path_walk: get rid of duplication
link_path_walk: final preparations to killing recursion
link_path_walk: kill the recursion
link_path_walk: split "return from recursive call" path
link_path_walk: cleanup - turn goto start; into continue;
namei: fold may_follow_link() into follow_link()
namei: introduce nameidata->stack
namei: regularize use of put_link() and follow_link(), trim arguments
namei: trim the arguments of get_link()
new ->follow_link() and ->put_link() calling conventions
uninline walk_component()

NeilBrown (5):
VFS: replace {, total_}link_count in task_struct with pointer to nameidata
ovl: rearrange ovl_follow_link to it doesn't need to call ->put_link
VFS: replace nameidata arg to ->put_link with a char*.
SECURITY: remove nameidata arg from inode_follow_link.
VFS: remove nameidata args from ->follow_link

Diffstat:
Documentation/filesystems/Locking | 4 +-
Documentation/filesystems/porting | 10 +
Documentation/filesystems/vfs.txt | 4 +-
drivers/staging/lustre/lustre/llite/symlink.c | 26 +-
fs/9p/vfs_inode.c | 15 +-
fs/9p/vfs_inode_dotl.c | 9 +-
fs/autofs4/symlink.c | 5 +-
fs/befs/linuxvfs.c | 46 ++--
fs/ceph/inode.c | 6 +-
fs/cifs/cifsfs.h | 2 +-
fs/cifs/link.c | 29 +-
fs/configfs/symlink.c | 28 +-
fs/debugfs/file.c | 5 +-
fs/ecryptfs/inode.c | 11 +-
fs/exofs/symlink.c | 7 +-
fs/ext2/symlink.c | 6 +-
fs/ext3/symlink.c | 6 +-
fs/ext4/symlink.c | 6 +-
fs/freevxfs/vxfs_immed.c | 11 +-
fs/fuse/dir.c | 19 +-
fs/gfs2/inode.c | 11 +-
fs/hostfs/hostfs_kern.c | 16 +-
fs/hppfs/hppfs.c | 9 +-
fs/jffs2/symlink.c | 9 +-
fs/jfs/symlink.c | 6 +-
fs/kernfs/symlink.c | 23 +-
fs/libfs.c | 7 +-
fs/namei.c | 378 +++++++++++++++-----------
fs/nfs/symlink.c | 18 +-
fs/ntfs/namei.c | 1 -
fs/overlayfs/inode.c | 36 +--
fs/proc/base.c | 4 +-
fs/proc/inode.c | 8 +-
fs/proc/namespaces.c | 4 +-
fs/proc/self.c | 24 +-
fs/proc/thread_self.c | 22 +-
fs/sysv/symlink.c | 5 +-
fs/ubifs/file.c | 7 +-
fs/ufs/symlink.c | 6 +-
fs/xfs/xfs_iops.c | 12 +-
include/linux/fs.h | 11 +-
include/linux/namei.h | 7 +-
include/linux/sched.h | 3 +-
include/linux/security.h | 9 +-
mm/shmem.c | 28 +-
security/capability.c | 3 +-
security/security.c | 4 +-
security/selinux/hooks.c | 2 +-
48 files changed, 457 insertions(+), 471 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/