[PATCH 17/39] union-mount: Union mounts documentation

From: Valerie Aurora
Date: Mon May 03 2010 - 19:19:53 EST


Document design and implementation of union mounts (a.k.a. writable
overlays).
---
Documentation/filesystems/union-mounts.txt | 899 ++++++++++++++++++++++++++++
1 files changed, 899 insertions(+), 0 deletions(-)
create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..ba830e8
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,899 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union
+mounts, also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over a one read-only
+file system, with all writes going to the writable file system. The
+namespace of both file systems appears as a combined whole to
+userland, with files and directories on the writable file system
+covering up any files or directories with matching pathnames on the
+read-only file system. The read-write file system is the "topmost"
+or "upper" file system and the read-only file system is the "lower"
+file system. A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device or a clustered file system (include NFS mounts). However, for
+some use cases, sharing is more efficient and better performing if
+done at the file system namespace level. COW block devices only
+increase their divergence as time goes on, and a fully coherent
+writable file system is unnecessary synchronization overhead if no
+other client needs to see the writes.
+
+What union mounts are not
+-------------------------
+
+Union mounts are not a general-purpose unioning file system. They do
+not provide a generic "union of namespaces" operation for an arbitrary
+number of file systems. Many interesting features can be implemented
+with a generic unioning facility: unioning of more than two file
+systems, dynamic insertion and removal of branches, online upgrade,
+etc. Some unioning file systems that do this are UnionFS and AUFS.
+
+File systems can only be union mounted at their mountpoints, and the
+lower level file system cannot have any submounts.
+
+Terminology
+===========
+
+The main physical metaphor for union mounts is that a writable file
+system is mounted "on top" of a read-only file system. Lookups start
+at the "topmost" read-write file system and travel "down" to the
+"bottom" read-only file system only if no blocking entry exists on the
+top layer.
+
+Topmost layer: The read-write file system. Lookups begin here.
+
+Bottom layer: The read-only file system. Lookups end here.
+
+Path: Combination of the vfsmount and dentry structure.
+
+Follow down: Given a path from the top layer, find the corresponding
+path on the bottom layer.
+
+Follow up: Given a path from the bottom layer, find the corresponding
+path on the top layer.
+
+Whiteout: A directory entry in the top layer that prevents lookups
+from travelling down to the bottom layer. Created on unlink()/rmdir()
+if a corresponding directory entry exists in the bottom layer.
+
+Opaque flag: A flag on a directory in the top layer that prevents
+lookups of entries in this directory from travelling down to the
+bottom layer (unless there is an explicit fallthru entry allowing that
+for a particular entry). Set on creation of a directory that replaces
+a whiteout, and after a directory copyup.
+
+Fallthru: A directory entry which allows lookups to "fall through" to
+the bottom layer for that exact directory entry. This serves as a
+placeholder for directory entries from the bottom layer during
+readdir(). Fallthrus override opaque flags.
+
+File copyup: Create a file on the top layer that has the same metadata
+and contents as the file with the same pathname on the bottom layer.
+
+Directory copyup: Copy up the visible directory entries from the
+bottom layer as fallthrus in the matching top layer directory. Mark
+the directory opaque to avoid unnecessary negative lookups on the
+bottom layer.
+
+Examples
+========
+
+What happens when I...
+
+- creat() /newfile -> creates on topmost layer
+- unlink() /oldfile -> creates a whiteout on topmost layer
+- Edit /existingfile -> copies up to top layer at open(O_WR) time
+- truncate /existingfile -> copies up to topmost layer + N bytes if specified
+- touch()/chmod()/chown()/etc. -> copies up to topmost layer
+- mkdir() /newdir -> creates on topmost layer
+- rmdir() /olddir -> creates a whiteout on topmost layer
+- mkdir() /olddir after above -> creates on topmost layer w/ opaque flag
+- readdir() /shareddir -> copies up entries from bottom layer as fallthrus
+- link() /oldfile /newlink -> copies up /oldfile, creates /newlink on topmost layer
+- symlink() /oldfile /symlink -> nothing special
+- rename() /oldfile /newfile -> copies up /oldfile to /newfile on top layer
+- rename() /olddir /newdir -> EXDEV
+- rename() /topmost_only_dir /topmost_only_dir2 -> success
+
+Getting to a root file system with union mounts:
+
+- Mount the base read-only file system as the root file system
+- Mount the read-only file system again on /newroot
+- Mount the read-write layer on /newroot:
+ # mount -o union /dev/sda /newroot
+- pivot_root to /newroot
+- Start init
+
+See scripts/pivot.sh in the UML devkit linked to from:
+
+http://valerieaurora.org/union/
+
+VFS implementation
+==================
+
+Union mounts are implemented as an integral part of the VFS, rather
+than as a VFS client file system (i.e., a stacked file system like
+unionfs or ecryptfs). Implementing unioning inside the VFS eliminates
+the need for duplicate copies of VFS data structures, unnecessary
+indirection, and code duplication, but requires very maintainable,
+low-to-zero overhead code. Union mounts require no change to file
+systems serving as the read-only layer, and requires some minor
+support from file systems serving as the read-write layer. File
+systems that want to be the writable layer must implement the new
+->whiteout() and ->fallthru() inode operations, which create special
+dummy directory entries.
+
+The union mounts code must accomplish the following major tasks:
+
+1) Pass lookups through to the lower level file system.
+2) Copy files and directories up to the topmost layer when written.
+3) Create whiteouts and fallthrus as necessary.
+
+VFS objects and union mounts
+----------------------------
+
+First, some VFS basics:
+
+The VFS allows multiple mounts of the same file system. For example,
+/dev/sda can be mounted at /usr and also at /mnt. The same file
+system can be mounted read-only at one point and read-write at
+another. Each of these mounts has its own vfsmount data structure in
+the kernel. However, each underlying file system has exactly one
+in-kernel superblock structure no matter how many times it is mounted.
+All the separate vfsmounts for the same file system reference the same
+superblock data structure.
+
+Directory entries are cached by the VFS in dentry structures. The VFS
+keeps one dentry structure for each file or directory in a file
+system, no matter how many times it is mounted. Each dentry
+represents only one element of a path name. When the VFS looks up a
+pathname (e.g., "/sbin/init"), the result is combination of vfsmount
+and dentry. This <mnt,dentry> pair is usually stored in a kernel
+structure named "path", which is simply two pointers, one to the
+vfsmount and one to the dentry. A "struct path" is this structure; a
+pathname is a string like "/etc/fstab".
+
+As an example, given:
+
+/dev/sda mounted on /mnt
+/dev/sda mounted on /mnt2
+
+A pathname lookup for "/mnt/etc" will yield the pair:
+
+<vfsmount for /mnt, dentry for "etc" on /dev/sda>
+
+A pathname lookup for "/mnt2/etc" will yield the pair:
+
+<vfsmount for /mnt2, dentry for "etc" on /dev/sda>
+
+The dentry in both cases will be the exact same structure in memory.
+
+A union mount maps <mnt,dentry> pairs from the file system mounted on
+the "top" to <mnt,dentry> pairs from the file system on the "bottom."
+The same dentry can be a member of more than one union mount. For
+example, given:
+
+/dev/sdb union mounted on top of /dev/sda on /mnt/union1
+/dev/sdc union mounted on top of /dev/sda on /mnt/union2
+
+The dentry for the directory "etc/" on /dev/sda will part of two union
+mount mappings:
+
+<vfsmount for /dev/sdb on /mnt/union1, dentry for "etc" on /dev/sdb>
+ |
+ v
+<vfsmount for /dev/sda on /mnt/union1, dentry for "etc" on /dev/sda>
+
+And:
+
+<vfsmount for /dev/sdc on /mnt/union2, dentry for "etc" on /dev/sdb>
+ |
+ v
+<vfsmount for /dev/sda on /mnt/union2, dentry for "etc" on /dev/sda>
+
+All of this is to say that we require a full <mnt,dentry> pair to
+accomplish any union mount tasks like copying a file to the topmost
+layer or looking up a directory entry in a lower layer. A dentry
+alone is not sufficient, since it can be part of several different
+union mounts.
+
+union_dir structure
+---------------------
+
+The first job of union mounts is to map directories from the topmost
+layer to directories with the same pathname in the lower layer. That
+is, we need to map the <mnt,dentry> pair for a given directory
+pathname in the topmost layer to the <mnt,dentry> pair for the
+directory with the same pathname in the lower layer. We do this with
+the union_dir structure:
+
+struct union_dir {
+ atomic_t u_count; /* reference count */
+ struct list_head u_unions; /* list head for d_unions */
+ struct list_head u_list; /* list head for mnt_unions */
+ struct hlist_node u_hash; /* list head for searching */
+ struct hlist_node u_rhash; /* list head for reverse searching */
+
+ struct path u_upper; /* this is me */
+ struct path u_lower; /* this is what I overlay */
+};
+
+This structure is flexible enough to support an arbitrary number of
+layers of unioned file systems, not just the current two-layer
+implementation. As such, this section will talk about mapping "upper"
+directories to "lower" directories, instead of "topmost" directories
+to "bottom" directories.
+
+At the time of a union mount, we allocate a union_dir structure to map
+the root directory of the upper layer to the root directory of the
+lower layer. In pseudo-code:
+
+u_upper = <upper mnt,dentry for "/">
+u_lower = <lower mnt,dentry for "/">
+
+This union_dir structure is then added to the union cache hash table,
+linked through u_hash, where it can be looked up via union_lookup()
+with the <upper mnt,dentry> pair as the key. A reverse lookup is also
+included (union_rlookup() using the <lower mnt,dentry> pair, linked
+through u_rhash) but is not currently used.
+
+The union_dir is also added to the list of union_dir structures that
+reference this dentry as the topmost dentry. This list is linked
+through u_unions member in struct union_dir and the new d_unions
+member in struct dentry. The new d_union_lower_count member in struct
+dentry is a reference count showing how many unions reference this
+dentry through u_lower - that is, how many mounts this dentry is a
+lower dentry for.
+
+struct dentry {
+[...]
+#ifdef CONFIG_UNION_MOUNT
+ /*
+ * Union mount structures that reference this dentry as the
+ * upper layer are linked through the d_unions field. If this
+ * list is not empty, then this dentry is part of a unioned
+ * directory stack. Protected by union_lock.
+ */
+ struct list_head d_unions;
+ /*
+ * Reference count of union_dirs with this dentry in the
+ * u_lower field of a union mount structure - that is, it is a
+ * dentry for a lower layer of a union. This count is NOT
+ * incremented for the dentry that is part of the topmost
+ * layer of a union.
+ */
+ unsigned int d_union_lower_count;
+#endif
+[...]
+};
+
+Each union_dir is also linked through the new mnt_unions member in the
+vfsmount structure of the upper mount:
+
+struct vfsmount {
+[...]
+#ifdef CONFIG_UNION_MOUNT
+ struct list_head mnt_unions; /* list of union_dir structures */
+#endif
+[...]
+};
+
+Traversing the union stack
+--------------------------
+
+The set of union_dir structures referring to a particular pathname are
+called collectively the union stack for that directory. (In the
+current code, only two layers and one union mount structure per path
+is allowed, but multiple layers are possible.) Note that in a union
+stack, none of the union_dir structures reference each other directly.
+Each union_dir struct records the relationship between two
+<mnt,dentry> pairs, the upper pair and the lower pair. If a third
+layer existed, you would traverse from the top layer to the second
+layer by calling union_lookup() on the top layer's <mnt,dentry> pair.
+This would return the union_dir struct with u_upper pointing to the
+top layer's <mnt,dentry>. Next you would take u_lower, which points
+to the second layer's <mnt,dentry> and call union_lookup() on that,
+which would return the union_dir mapping the second layer's
+<mnt,dentry> to the third layer's <mnt,dentry>.
+
+To traverse "down" the union stack one layer, use union_down_one().
+Currently, we never traverse the union stack "up" except as part of
+the normal VFS follow_mount() operation. follow_mount() is what lets
+us traverse from the directory serving as mountpoint to the root
+directory of the file system mounted at that mountpoint. Traversing
+the union stack "up" introduces lock ordering problems and generally
+complicates the code to the point of unmaintainability. Currently,
+union mounts performs all its tasks as it traverses the union stack
+exactly once, going "down" in the union mounts terminology.
+
+Code paths
+----------
+
+Union mounts modify the following key code paths in the VFS:
+
+- mount()/umount()
+- Pathname lookup
+- Any path that modifies an existing file
+
+Mount
+-----
+
+Union mounts are created in two steps:
+
+1. Mount the bottom layer file system read-only in the usual manner.
+2. Mount the top layer with the "-o union" option at the same mountpoint.
+
+The bottom layer must be read-only and the top layer must be
+read-write and support whiteouts and fallthrus. A file system that
+supports whiteouts and fallthrus indicates this by setting the
+MS_WHITEOUT flag in the superblock. Currently, the top layer is
+forced to "noatime" to avoid a copyup on every access of a file.
+Supporting atime with the current infrastructure would require a
+copyup on every open(). The "relatime" option would be equally
+efficient if the atime is the same or more recent than the mtime/ctime
+for every object on the read-only file system, and if the 24-hour
+timeout on relatime was disabled. However, this is probably not
+worthwhile for the majority of union mount use cases.
+
+The current step-by-step method of mounting union file systems won't
+work for three or more layers. Say you want to union mount three file
+systems on /mnt/union:
+
+/dev/bottom - read-only bottom layer
+/dev/middle - read-only middle layer
+/dev/topmost - read-write topmost layer
+
+First you mount the bottom layer read-only:
+
+mount -o ro /dev/bottom /mnt/union
+
+Then you want to mount the middle layer also read-only, but union
+mounts requires that the top layer be read-write in order to support
+readdir() correctly:
+
+mount -o ro,union /dev/middle /mnt/union # WON'T WORK, fails
+
+The other approach is to mount the middle layer as read-write, but
+then the third mount of the topmost layer will fail because the
+underlying layer is not read-only:
+
+mount -o union /dev/middle /mnt/union
+mount -o union /dev/topmost /mnt/union # WON'T WORK, fails
+
+Two obvious options present themselves:
+
+1) Automatically attempt to convert the covered layer to read-only
+status. In this case, the mount of /dev/topmost would attempt to
+atomically remount /dev/middle as read-only during sys_mount(). If it
+succeeds, it would go on to mount /dev/topmost as read-write and
+unioned. This would actually be a usability improvement, since the
+administrator need not remember to mount the lower layers read-only.
+
+2) Execute the mount of all three layers in one system call by passing
+a mount option that is a string describing all the devices to be
+unioned together. This is ugly for obvious reasons: string parsing in
+the kernel, poor error granularity, need to unwind complicated state
+if the mount fails partway through the stack.
+
+The lower layer file system must not have any submounts - other file
+systems mounted at points in the lower file system's namespace. File
+systems can only be union mounted at their root directories. Without
+this restriction, some VFS operations must always do a union_lookup()
+- requiring a global lock - in order to find out if a path is
+potentially unioned. With this restriction, we can tell if a path is
+potentially unioned by checking a flag in the vfsmount.
+
+pivot_root() to a union mounted file system is supported. The
+recommended way to get to a union mounted root file system is to boot
+with the read-only mount as the root file system, construct the union
+mount on an entirely new mount, and pivot_root() to the new union
+mount root. Attempting to union mount the root file system later in
+boot will result in covering other file systems, e.g., /proc, which
+isn't permitted in the current code and is a bad idea anyway.
+
+Hard read-only file systems
+---------------------------
+
+Union mounts require the lower layer of the file system to be
+read-only. However, in Linux, any individual file system may be
+mounted at multiple places in the namespace, and a file system can be
+changed from read-only to read-write while still mounted. Thus, simply
+checking that the bottom layer is read-only at the time the writable
+overlay is mounted over it is pointless, since at any time the bottom
+layer may become read-write.
+
+We have to guarantee that a file system will be read-only for as long
+as it is the bottom layer of a union mount. To do this, we track the
+number of hard read-only users of a file system in its VFS superblock
+structure. When we union mount a writable overlay over a file system,
+we increment its read-only user count. The file system can only be
+mounted read-write if its read-only users count is zero.
+
+Todo:
+
+- Support hard read-only NFS mounts. See discussion here:
+
+ http://markmail.org/message/3mkgnvo4pswxd7lp
+
+Pathname lookup
+---------------
+
+Pathname lookup in a unioned directory traverses down the union stack
+for the parent directory, looking up each pathname element in each
+layer of the file system (according to the rules of whiteouts,
+fallthrus, and opaque flags). At mount time, the union stack for the
+root directory of the file system is created, and the union stack
+creation for every other unioned directory in the file system is
+boot-strapped using the already-existing union stack of the
+directory's parent. In order to simplify the code greatly, every
+visible directory on the lower file system is required to have a
+matching directory on the upper file system. This matching directory
+is created during pathname lookup if does not already exist.
+Therefore, each unioned directory is the child of another unioned
+directory (or is the root directory of the file system).
+
+As a high-level example, consider lookup of the lower layer file
+"/mnt/union/lower_subdir/lower_file" in the union of /dev/lower and
+/dev/upper, starting with the <mnt,dentry> pair for the the root
+directory of the union mount.
+
+First, we lookup "lower_subdir" in the parent directory, "/". Since
+this is the root directory for the mount, it already has a union stack
+constructed, consisting of one struct union_dir in the union hash
+table, filled out with:
+
+um->u_upper = <upper mnt,dentry for "/">
+um->u_lower = <lower mnt,dentry for "/">
+
+Using union_down_one(), we traverse the union stack for "/", looking
+up "lower_subdir" in the "/" directory for /dev/upper, and then in
+/dev/lower. "lower_subdir" only exists in the lower layer, so we
+create a matching directory in the upper layer, and then allocate and
+fill out a union_dir struct that maps these directories to each other:
+
+um->u_upper = <upper mnt,dentry for "lower_subdir">
+um->u_lower = <lower mnt,dentry for "lower_subdir">
+
+Now lookup proceeds with the <upper mnt,dentry> for "lower_subdir" and
+the pathname element "lower_file". We lookup "lower_file" in the
+upper layer directory, finding no match. Since this is a unioned
+directory, we call union_down_one() on the <upper mnt,dentry for
+"lower_subdir">, which lookups up the union_dir structure we just
+created and returns the <lower mnt,dentry> pair. We then lookup
+"lower_file" in the lower layer directory, which succeeds. Unlike
+directories, files are not copied up at lookup time, so pathname
+lookup for "/mnt/union/lower_subdir/lower_file" is now complete with
+the final struct path of <lower mnt,dentry for "lower_file">.
+
+At a finer level of detail, the actual union lookup function is called
+in the following code paths:
+
+do_lookup()->do_union_lookup()->lookup_union()->__lookup_union()
+lookup_hash()->lookup_union()->__lookup_union()
+
+__lookup_union() is where the rules of whiteouts, fallthrus, and
+opaque flags are actually implemented. __lookup_union() returns
+either the first visible dentry, or a negative dentry from the topmost
+file system if no matching dentry exists. If it finds a directory, it
+looks up any potential matching lower layer directories. If it finds
+a lower layer directory, it calls append_to_union() on the pair of
+directories. append_to_union() looks up the upper path in the union
+cache and if no union cache entry already exists, it creates one.
+
+Note that not all directories in a union mount are unioned, only those
+with matching directories on the lower layer. The macro
+IS_UNIONED_DIR() is a cheap, constant time way to check if a directory
+is unioned, while IS_MNT_UNION() checks if the entire mount is unioned
+(and therefore whether the directory in question is potentially
+unioned).
+
+Currently, lookup of a negative dentry in a unioned directory requires
+a lookup in every directory in the union stack every time it is looked
+up. We could avoid subsequent lookups by adding a negative union
+cache entry, exactly the way negative dentries are cached.
+
+File copyup
+-----------
+
+Any system call that alters the data or metadata of a file on the
+bottom layer, or creates or changes a hard link to it will trigger a
+copyup of the target file from the lower layer to the topmost layer
+
+ - open(O_WRITE | O_RDWR | O_APPEND | O_DIRECT)
+ - truncate()/open(O_TRUNC)
+ - link()
+ - rename()
+ - chmod()
+ - chown()/lchown()
+ - utimes()
+ - setxattr()/lsetxattr()
+
+Copyup of a file due to open(O_WRITE) has already occurred when:
+
+ - write()
+ - ftruncate()
+ - writable mmap()
+
+The following system calls will fail on an fd opened O_RDONLY:
+
+ - fchmod()
+ - fchown()
+ - fsetxattr()
+ - futimensat()
+
+Contrary to common sense, the above system calls are defined to
+succeed on O_RDONLY fds. The idea seems to be that the
+O_RDONLY/O_RDWR/O_WRITE flags only apply to the actual file data, not
+to any form of metadata (times, owner, mode, or even extended
+attributes). Applications making these system calls on O_RDONLY fds
+are correct according to the standard and work on non-union-mounts.
+They will need to be rewritten (O_RDONLY -> O_RDWR) to work on union
+mounts. We suspect this usage is uncommon.
+
+This deviation from standard is due to technical limitations of the
+union mount implementation. Specifically, we would need to replace an
+open file descriptor from the lower layer with an open file descriptor
+for a file with matching pathname and contents on the upper layer,
+which is difficult to do. We avoid this in other system calls by
+doing the copyup before the file is opened. Unionfs doesn't encounter
+this problem because it creates a dummy file struct which redirects or
+fans out operations to the struct files for the underlying file
+systems.
+
+From an application's point of view, the result of an in-kernel file
+copyup is the logical equivalent of another application updating the
+file via the rename() pattern: creat() a new file, copy the data over,
+make changes the copy, and rename() over the old version. Any
+existing open file descriptors for that file (including those in the
+same application) refer to a now invisible object that used to have
+the same pathname. Only opens that occur after the copyup will see
+updates to the file.
+
+Permission checks
+-----------------
+
+We want to be sure we have the correct permissions to actually succeed
+in a system call before copying a file up to avoid unnecessary IO. At
+present, the permission check for a single system call may be spread
+out over many hundreds of lines of code (e.g., open()). In order to
+check permissions, we occasionally need to determine if there is a
+writable overlay on top of this inode. This requires a full path, but
+often we only have the inode at this point. In particular,
+inode_permission() returns EROFS if the inode is on a read-only file
+system, which is the wrong answer if there is a writable overlay
+mounted on top of it.
+
+Another trouble-maker is may_open(), which both checks permissions for
+open AND truncates the file if O_TRUNC is specified. It doesn't make
+any sense to copy up the file and then let may_open() truncate it, but
+we can't copy it after may_open() truncates it either. The current
+ugly hack is to pass the full nameidata to may_open() and copyup
+inside may_open().
+
+Some solutions:
+
+- Create __inode_permission() and pass it a flag telling it whether or
+ not to check for a read-only fs. Create union_permission() which
+ takes a path, checks for a union mount, and sets the rofs flag.
+ Place the file copyup call after all the permission checks are
+ completed. Push down the full path into the functions that need it
+ and currently only take the dentry or inode.
+
+- For each instance in which we might want to copyup, move permission
+ checks into a new function and call it from a level at which we
+ still have the full path. Pass it an "ignore read-only fs" flag if
+ the file is on a union mount. Pass around the ignore-rofs flag
+ inside the function doing permission checks. If all the permission
+ checks complete successfully, copyup the file. Would require moving
+ truncate out of may_open().
+
+Todo:
+ - On truncate, only copy up the N bytes of file data requested
+ - Make sure above handles truncate beyond EOF correctly
+ - File copyup on chown()/chmod()/chattr() etc.
+ - File copyup on open(O_APPEND)
+ - File copyup on open(O_DIRECT)
+
+Impact on non-union kernels and mounts
+--------------------------------------
+
+Union-related data structures, extra fields, and function calls are
+#ifdef'd out at the function/macro level with CONFIG_UNION_MOUNT in
+nearly all cases (see include/linux/union.h).
+
+Todo:
+
+ - Do performance tests
+
+Locking strategy
+================
+
+The current union mount locking strategy is based on the following
+rules:
+
+* Exactly two file systems are unioned
+* The bottom file system is always read-only
+* The top file system is always read-write
+ => A file system can never a top and a bottom layer at the same time
+
+Additionally, the top layer may only be mounted exactly once. Don't
+think of the top layer as a separate independent file system; when it
+is part of a union mount, it is only a file system in conjunction with
+the read-only bottom layer. The read-only bottom layer is an
+independent file system in and of itself and can be mounted elsewhere,
+including as the bottom layer for another union mount.
+
+Thus, we may define a stable locking order in terms of top layer and
+bottom layer locks, since a top layer is never a bottom layer and a
+bottom layer is never a top layer. Another simplifying assumption is
+that all directories in a pathname exist on the top layer, as they are
+created step-by-step during lookup. This prevents us from ever having
+to walk backwards up the path creating directory entries, which can
+get complicated. By implication, parent directories paths during any
+operation (rename(), unlink(),etc.) are from the top layer. Dentries
+for directories from the bottom layer are only ever seen or used by
+the lookup code.
+
+The two major problems we avoid with the above rules are:
+
+Lock ordering: Imagine two union stacks with the same two file
+systems: A mounted over B, and B mounted over A. Sometimes locks on
+objects in both A and B will have to be held simultanously. What
+order should they be acquired in? Simply acquiring them from top to
+bottom will create a lock-ordering problem - one thread acquires lock
+on object from A and then tries for a lock on object from B, while
+another thread grabs the lock on object from B and then waits for the
+lock on object from A. Some other lock ordering must be defined.
+
+Movement/change/disappearance of objects on multiple layers: A variety
+of nasty corner cases arise when more than one layer is changing at
+the same time. Changes in the directory topology and their effect on
+inheritance are of special concern. Al Viro's canonical email on the
+subject:
+
+http://lkml.indiana.edu/hypermail/linux/kernel/0802.0/0839.html
+
+We don't try to solve any of these cases, just avoid them in the first
+place.
+
+Todo: Prevent top layer from being mounted more than once.
+
+Cross-layer interactions
+------------------------
+
+The VFS code simultaneously holds references to and/or modifies
+objects from both the top and bottom layers in the following cases:
+
+Path lookup:
+
+Grabs i_mutex on bottom layer while holding i_mutex on top layer
+directory inode.
+
+File copyup:
+
+Holds i_mutex on the parent directory from the top layer while copying
+up file from lower layer.
+
+link():
+
+File copyup of target while holding i_mutex on parent directory on top
+layer. Followed by a normal link() operation.
+
+rename():
+
+Holds s_vfs_rename_mutex on the top layer, i_mutex of the source's
+parent dir (top layer), and i_mutex of the target's parent dir (also
+top layer) while looking up and copying the bottom layer target and
+also creating the whiteout.
+
+Notes on rename():
+
+First, renaming of directories returns EXDEV. It's not at all
+reasonable to recursively copy directory trees and userspace has to
+handle this case anyway. An exception is rename() of directories that
+exist only on the topmost layer; this succeeds.
+
+Rename involves three steps on a union mount: (1) copyup of the file
+from the bottom layer, (2) rename of the new top-layer copy to the
+target in the usual manner, (3) creation of a whiteout covering the
+source of the rename.
+
+Directory copyup:
+
+Directory entries are copied up on the first readdir(). We hold the
+top layer directory i_mutex throughout and sequentially acquire and
+drop the i_mutex for each lower layer directory.
+
+VFS-fs interface
+================
+
+Read-only layer: No support necessary other than enforcement of really
+really read-only semantics (done by VFS for local file systems).
+
+Writable layer: Must implement two new inode operations:
+
+int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
+int (*fallthru) (struct inode *, struct dentry *);
+
+And set the MS_WHITEOUT flag to indicate support of these operations.
+
+Todo:
+
+- Decide what to return in d_ino of struct dirent
+ - As Miklos Szeredi points out, the inode number from the underlying
+ fs is from a different inode "namespace" and doesn't have any
+ useful meaning in the top layer fs.
+- Implement whiteouts and fallthrus in ext3
+- Implement whiteouts and fallthrus in btrfs
+
+Supported file systems
+----------------------
+
+Any file system can be a read-only layer. File systems must
+explicitly support whiteouts and fallthrus in order to be a read-write
+layer. This patch set implements whiteouts for ext2, tmpfs, and
+jffs2. We have tested ext2, tmpfs, and iso9660 as the read-only
+layer.
+
+Todo:
+ - Test corner cases of case-insensitive/oversensitive file systems
+
+NFS interaction
+===============
+
+NFS is currently not supported as either type of layer. NFS as
+read-only layer requires support from the server to honor the
+read-only guarantee needed for the bottom layer. To do this, the
+server needs to revoke access to clients requesting read-only file
+systems if the exported file system is remounted read-write or
+unmounted (during which arbitrary changes can occur). Some recent
+discussion:
+
+http://markmail.org/message/3mkgnvo4pswxd7lp
+
+NFS as the read-write layer would require implementation of the
+->whiteout() and ->fallthru() methods. DT_WHT directory entries are
+theoretically already supported.
+
+Also, technically the requirement for a readdir() cookie that is
+stable across reboots comes only from file systems exported via NFSv2:
+
+http://oss.oracle.com/pipermail/btrfs-devel/2008-January/000463.html
+
+Todo:
+
+- Guarantee really really read-only on NFS exports
+- Implement whiteout()/fallthru() for NFS
+
+Userland support
+================
+
+The mount command must support the "-o union" mount option and pass
+the corresponding MS_UNION flag to the kerel. A util-linux git
+tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
+
+File system utilities must support whiteouts and fallthrus. An
+e2fsprogs git tree with union mount support is here:
+
+git://git.kernel.org/pub/scm/fs/ext2/val/e2fsprogs.git
+
+Currently, whiteout directory entries are not returned to userland.
+While the directory type for whiteouts, DT_WHT, has been defined for
+many years, very little userland code handles them. Userland will
+never see fallthru directory entries.
+
+Known non-POSIX behaviors
+-------------------------
+
+- Any writing system call (unlink()/chmod()/etc.) can return ENOSPC or EIO
+- Link count may be wrong for files on bottom layer with > 1 link count
+- Link count on directories will be wrong before readdir() (fixable)
+- File copyup is the logical equivalent of an update via copy +
+ rename(). Any existing open file descriptors will continue to refer
+ to the read-only copy on the bottom layer and will not see any
+ changes that occur after the copy-up.
+- rename() of directory fails with EXDEV
+- inode number in d_ino of struct dirent will be wrong for fallthrus
+- fchmod()/fchown()/futimensat()/fsetattr() fail on O_RDONLY fds
+
+Status
+======
+
+The current union mounts implementation is feature-complete on local
+file systems and passes an extensive union mounts test suite,
+available in the union mounts Usermode Linux-based development kit:
+
+http://valerieaurora.org/union/union_mount_devkit.tar.gz
+
+The whiteout code has had some non-trivial level of review and
+testing, but the majority of the rest of the code has had no external
+review or testing outside the authors' machines.
+
+The latest version is available at:
+
+git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git
+
+Check the union mounts web page for the name of the latest branch:
+
+http://valerieaurora.org/union/
+
+Todo:
+
+- Run more tests (e.g., XFS test suite)
+- Get review from VFS maintainers
+
+Non-features
+------------
+
+Features we do not currently plan to support in union mounts:
+
+Online upgrade: E.g., installing software on a file system NFS
+exported to clients while the clients are still up and running.
+Allowing the read-only bottom layer of a union mount to change
+invalidates our locking strategy.
+
+Recursive copying of directories: E.g., implementing rename() across
+layers for directories. Doing an in-kernel copy of a single file is
+bad enough. Recursively copying a directory is a big no-no.
+
+Read-only top layer: The readdir() strategy fundamentally requires the
+ability to create persistent directory entries on the top layer file
+system (which may be tmpfs). Numerous alternatives (including
+in-kernel or in-application caching) exist and are compatible with
+union mounts with its writing-readdir() implementation disabled.
+Creating a readdir() cookie that is stable across multiple readdir()s
+requires one of:
+
+- Write to stable storage (e.g., fallthru dentries)
+- Non-evictable kernel memory cache (doesn't handle NFS server reboot)
+- Per-application caching by glibc readdir()
+
+Aggregation of multiple read-only file systems: We are beginning to
+see how to implement this but it doesn't currently work.
+
+Often these features are supported by other unioning file systems or
+by other versions of union mounts.
+
+Contributing to union mounts
+============================
+
+The union mounts web page is here:
+
+http://valerieaurora.org/union/
+
+It links to:
+
+ - All git repositories
+ - Documentation
+ - An entire self-contained UML-based dev kit with README, etc.
+
+The best mailing list for discussing union mounts is:
+
+linux-fsdevel@xxxxxxxxxxxxxxx
+
+http://vger.kernel.org/vger-lists.html#linux-fsdevel
+
+Thank you for reading!
--
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/