Re: [PATCH] Documentation: add new description of path-name lookup.

From: Randy Dunlap
Date: Sun Jul 26 2015 - 23:42:09 EST


On 07/24/15 17:28, NeilBrown wrote:
>

Hi Neil,

Some edits for you to consider.


>
> diff --git a/Documentation/filesystems/path-lookup.md b/Documentation/filesystems/path-lookup.md
> new file mode 100644
> index 000000000000..eedf31c737ed
> --- /dev/null
> +++ b/Documentation/filesystems/path-lookup.md
> @@ -0,0 +1,1297 @@
> +<head>
> +<style> p { max-width:50em} ol, ul {max-width: 40em}</style>
> +</head>
> +
> +Pathname lookup in Linux.
> +=========================
> +
> +This write-up is based on three articles published at lwn.net:
> +
> +- <https://lwn.net/Articles/649115/> Pathname lookup in Linux
> +- <https://lwn.net/Articles/649729/> RCU-walk: faster pathname lookup in Linux
> +- <https://lwn.net/Articles/650786/> A walk among the symlinks
> +
> +Written by Neil Brown with help from Al Viro and Jon Corbet.
> +
> +Introduction
> +------------


> +Bringing it together with `struct nameidata`
> +--------------------------------------------
> +

> +### `struct path root` ###
> +
> +This is used to hold a reference to the effective root of the
> +filesystem. Often that reference won't be needed, so this field is
> +only assigned the first time it is used, or when a non-standard root
> +is requested. Keeping a reference in the `nameidata` ensures that
> +only one root is in effect for the entire path walk, even if it races
> +with a `chroot()` system call.
> +
> +The root is needed when either of two conditions holds: (1) either the
> +pathname or a symbolic link starts with a "'/'", or (2) a "`..`"
> +component is being handled, since "`..`" from the root must always stay
> +at the root. The value used is usually the current root directory of
> +the calling process. An alternate root can be provided as when
> +`sysctl()` calls `file_open_root()`, and when NFSv4 or Btrfs call
> +`mount_subtree()`. In each case a pathname is being looked up in a very
> +specific part of the filesystem, and the lookup must not be allowed to
> +escape that subtree. It works a bit like a local `chroot()`.
> +
> +Ignoring the handling of symbolic links, we can now describe the
> +"`link_path_walk()`" function, which handles the lookup of everything
> +except the final component as:
> +
> +> Given a path (`name`) and a nameidata structure (`nd`), check that the
> +> current directory has execute permission and then advance `name`
> +> over one component while updating `last_type` and `last`. If that
> +> was the final component, then return, otherwise call
> +> `walk_component()` and repeat from the top.
> +
> +`walk_component()` is even easier. If the component is `LAST_DOTS`,
> +it calls `handle_dots()` which does the necessary locking as already
> +described. If it finds a `LAST_NORM` component it first calls
> +"`lookup_fast()`" which only looks in the dcache, but will ask the
> +filesystem to revalidate the result if it is that sort of filesystem.
> +If that doesn't get a good result, it calls "`lookup_slow()`" which
> +takes the `i_mutex`, rechecks the cache, and then asks the filesystem
> +to find a definitive answer. Each of these will call
> +`follow_managed()` (as described below) to handle any mount points.
> +
> +In the absence of symbolic links, `walk_component()` creates a new
> +`struct path` containing a counted reference to the new dentry and a
> +reference to the new `vfsmount` which is only counted if it is
> +different from the previous `vfsmount`. It then calls
> +`path_to_nameidata()` to install the new `struct path` in the
> +`struct nameidate` and drop the unneeded references.

nameidata

> +
> +This "hand-over-hand" sequencing of getting a reference to the new
> +dentry before dropping the reference to the previous dentry may
> +seem obvious, but is worth pointing out so that we will recognize its
> +analogue in the "RCU-walk" version.
> +
> +Handling the final component.
> +-----------------------------
> +
> +`link_path_walk()` only walks as far as setting `nd->last` and
> +`nd->last_type` to refer to the final component of the path. It does
> +not call `walk_component()` that last time. Handling that final
> +component remains for the caller to sort out. Those callers are
> +`path_lookupat()`, `path_parentat()`, `path_mountpoint()` and
> +`path_openat()` each of which handles the differing requirements of
> +different system calls.
> +
> +`path_parentat()` is clearly the simplest - it just wraps a little bit
> +of housekeeping around `link_path_walk()` and returns the parent
> +directory and final component to the caller. The caller will be either
> +aiming to create a name (via `filename_create()`) or remove or rename
> +a name (in which case `user_path_parent()` is used). They will use
> +`i_mutex` to exclude other changes while they validate and then
> +perform their operation.
> +
> +`path_lookupat()` is nearly as simple - it is used when an existing
> +object is wanted such as by `stat()` or `chmod()`. It essentially just
> +calls `walk_component()` on the final component through a call to
> +`lookup_last()`. `path_lookupat()` returns just the final dentry.
> +
> +`path_mountpoint()` handles the special case of unmounting which must
> +not try to revalidate the mounted filesystem. It effectively
> +contains, through a call to `mountpoint_last()`, an alternate
> +implementation of `lookup_slow()` which skips that step. This is
> +important when unmounting a filesystem that is inaccessible, such as
> +one provided by a dead NFS server.
> +
> +Finally `path_openat()` is used for the `open()` system call; it
> +contains, in support functions starting with "`do_last()`", all the
> +complexity needed to handle the different subtleties of O_CREAT (with
> +or without O_EXCL) final "`/`" characters, and trailing symbolic

missing comma? , final

> +links. We will revisit this in the final part of this series, which
> +focuses on those symbolic links. "`do_last()`" will sometimes, but
> +not always, take `i_mutex`, depending on what it finds.
> +
> +Each of these, or the functions which call them, need to be alert to
> +the possibility that the final component is not `LAST_NORM`. If the
> +goal of the lookup is to create something, then any value for
> +`last_type` other than `LAST_NORM` will result in an error. For
> +example if `path_parentat()` reports `LAST_DOTDOT`, then the caller
> +won't try to create that name. They also check for trailing slashes
> +by testing `last.name[last.len]`. If there is any character beyond
> +the final component, it must be a trailing slash.
> +
> +Revalidation and automounts
> +---------------------------
> +
> +Apart from symbolic links, there are only two parts of the "REF-walk"
> +process not yet covered. One is the handling of stale cache entries
> +and the other is automounts.
> +
> +On filesystems that require it, the lookup routines will call the
> +`->d_revalidate()` dentry method to ensure that the cached information
> +is current. This will often confirm validity or update a few details
> +from a server. In some cases it may find that there has been change
> +further up the path and that something that was thought to be valid
> +previously isn't really. When this happens the lookup of the whole
> +path is aborted and retried with the "`LOOKUP_REVAL`" flag set. This
> +forces revalidation to be more thorough. We will see more details of
> +this retry process in the next article.
> +
> +Automount points are locations in the filesystem where an attempt to
> +lookup a name can trigger changes to how that lookup should be
> +handled, in particular by mounting a filesystem there. These are
> +covered in greater detail in autofs4.txt in the Linux documentation
> +tree, but a few notes specifically related to path lookup are in order
> +here.
> +
> +The Linux VFS has a concept of "managed" dentries which is reflected
> +in function names such as "`follow_managed()`". There are three
> +potentially interesting things about these dentries corresponding
> +to three different flags that might be set in `dentry->d_flags`:
> +


> +RCU-walk - faster pathname lookup in Linux
> +==========================================
> +
> +RCU-walk is another algorithm for performing pathname lookup in Linux.
> +It is in many ways similar to REF-walk and the two share quite a bit
> +of code. The significant difference in RCU-walk is how it allows for
> +the possibility of concurrent access.
> +
> +We noted that REF-walk is complex because there are numerous details
> +and special cases. RCU-walk reduces this complexity by simply
> +refusing to handle a number of cases -- it instead falls back to
> +REF-walk. The difficulty with RCU-walk comes from a different
> +direction: unfamiliarity. The locking rules when depending on RCU are
> +quite different to traditional locking, so we will spend a little extra

I would say: from

> +time when we come to those.
> +
> +Clear demarcation of roles
> +--------------------------
> +
> +The easiest way to manage concurrency is to forcibly stop any other
> +thread from changing the data structures that a given thread is
> +looking at. In cases where no other thread would even think of
> +changing the data and lots of different threads want to read at the
> +same time, this can be very costly. Even when using locks that permit
> +multiple concurrent readers, the simple act of updating the count of
> +the number of current readers can impose an unwanted cost. So the
> +goal when reading a shared data structure that no other process is
> +changing, is to avoid writing anything to memory at all. Take no

dubious comma^

> +locks, increment no counts, leave no footprints.
> +

> +RCU and seqlocks: fast and light
> +--------------------------------
> +

> +However, there is a little bit more to seqlocks than that. If
> +RCU-walk accesses two different fields in a seqlock-protected
> +structure, or accesses the same field twice, there is no a-priori

no hyphen: ^

> +guarantee of any consistency between those accesses. When consistency
> +is needed - which it usually is - RCU-walk must take a copy and then
> +use `read_seqcount_retry()` to validate that copy.
> +


> +`unlazy walk()` and `complete_walk()`
> +-------------------------------------
> +

> +A pair of patterns
> +------------------
> +
> +In various places in the details of REF-walk and RCU-walk, and also in
> +the big picture, there are a couple of related patterns that are worth
> +being aware of.
> +
> +The first is "try quickly and check, if that fails try slowly". We
> +can see that in the high-level approach of first trying RCU-walk and
> +then trying REF-walk, and in places were `unlazy_walk()` is used to

where

> +switch to REF-walk for the rest of the path. We also saw it earlier
> +in `dget_parent()` when following a "`..`" link. It tries a quick way
> +to get a reference, then falls back to taking locks if needed.
> +

> +A walk among the symlinks
> +=========================
> +

> +
> +Storage and lifetime of cached symlinks
> +---------------------------------------
> +

> +
> +When neither of these are suitable, the next most likely scenario is

is

> +that the filesystem will allocate some temporary memory and copy or
> +construct the symlink content into that memory whenever it is needed.
> +


> +Updating the access time
> +------------------------
> +
> +We previously said of RCU-walk that it would "take no locks, increment
> +no counts, leave no footprints." We have since seen that some
> +"footprints" can be needed when handling symlinks as a counted
> +reference (or even a memory allocation) may be needed. But these
> +footprints are best kept to a minimum.
> +
> +One other place where walking down a symlink can involve leaving
> +footprints in a way that doesn't affect directories is in updating access times.
> +In Unix (and Linux) every filesystem object has a "last accessed
> +time", or "`atime`". Passing through a directory to access a file
> +within, is not considered to be an access for the purposes of

drop ^ comma?

> +`atime`; only listing the contents of a directory can update its `atime`.
> +Symlinks are different it seems. Both reading a symlink (with `readlink()`)
> +and looking up a symlink on the way to some other destination can
> +update the atime on that symlink.
> +



--
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/