[RFC PATCH 10/10] vfs: shared subtree documentation

From: Ram
Date: Fri Sep 16 2005 - 13:27:45 EST



Complete description of shared subtrees.

Signed by Ram Pai (linuxram@xxxxxxxxxx)

Documentation/sharedsubtree.txt | 1015 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 1015 insertions(+)

Index: 2.6.13.sharedsubtree/Documentation/sharedsubtree.txt
===================================================================
--- /dev/null
+++ 2.6.13.sharedsubtree/Documentation/sharedsubtree.txt
@@ -0,0 +1,1015 @@
+Shared Subtrees
+---------------
+
+Contents:
+
+ 1) Overview
+ 2) Features
+ 3) smount command
+ 4) Use-case
+ 5) Detailed semantics
+ 6) Quiz
+ 7) FAQ
+ 8) Bugs
+ 9) Implementation
+
+
+1) Overview
+-----------
+
+Consider the following situation:
+
+A process wants to clone its own namespace, but still wants to access the CD
+that got mounted recently. Shared subtree semantics provide the necessary
+mechanism to accomplish the above.
+
+It provides the necessary building blocks for features like per-user-namespace
+and versioned filesystem.
+
+2) Features
+-----------
+
+Shared subtree provides four different flavors of mounts; struct vfsmount to be
+precise
+
+ a. shared mount
+ b. slave mount
+ c. private mount
+ d. unclonable mount
+
+
+2a) A shared mount can be replicated to as many mountpoints and all the
+replicas continue to be exactly same.
+
+ Here is an example:
+
+ Lets say /mnt has a mount that is shared.
+ mount --make-shared /mnt
+
+ note: mount command does not yet support the --make-shared flag.
+ I have included a small C program which does the same by executing
+ 'smount /mnt shared'
+
+ #mount --bind /mnt /tmp
+ The above command replicates the mount at /mnt to the mountpoint /tmp
+ and the contents of both the mounts remain identical.
+
+ #ls /mnt
+ a b c
+
+ #ls /tmp
+ a b c
+
+
+ Now lets say we mount a device at /tmp/a
+ #mount /dev/sd0 /tmp/a
+
+ #ls /tmp/a
+ t1 t2 t2
+
+ #ls /mnt/a
+ t1 t2 t2
+
+ Note that the mount has propagated to the mount at /mnt as well.
+
+ And the same is true even when /dev/sd0 is mounted on /mnt/a. The
+ contents will be visible under /tmp/a too.
+
+
+2b) A slave mount is like a shared mount except that mount and umount events
+ only propagate towards it.
+
+ All slave mounts have a master mount which is a shared mount.
+
+ Here is an example:
+
+ Lets say /mnt has a mount that is shared.
+ #mount --make-shared /mnt
+
+ Lets bind mount /mnt to /tmp
+ #mount --bind /mnt /tmp
+
+ the new mount at /tmp is also a shared mount and it is a replica of
+ the mount at /mnt.
+
+ Now lets make the mount at /tmp a slave of /mnt
+ #mount --make-slave /tmp
+ [or smount /tmp slave]
+
+ lets mount /dev/sd0 on /mnt/a
+ #mount /dev/sd0 /mnt/a
+
+ #ls /mnt/a
+ t1 t2 t3
+
+ #ls /tmp/a
+ t1 t2 t3
+
+ Note the mount event has propagated to the mount at /tmp
+
+ However lets see what happens if we mount something on the mount at /tmp
+
+ #mount /dev/sd1 /tmp/b
+
+ #ls /tmp/b
+ s1 s2 s3
+
+ #ls /mnt/b
+
+ Note how the mount event has not propagated to the mount at
+ /mnt
+
+
+2c) A private mount does not forward or receive propagation.
+
+ This is the mount we are familiar with. Its the default type.
+
+
+2d) A unclonable mount is a unreplicable private mount
+
+ lets say we have a mount at /mnt
+ and we make is unclonable
+
+ #mount --make-unclonable /mnt
+ [ smount /mnt unclonable ]
+
+ Lets try to bind mount this mount somewhere else.
+ # mount --bind /mnt /tmp
+ mount: wrong fs type, bad option, bad superblock on /mnt,
+ or too many mounted file systems
+
+ Replicating a unclonable mount is a invalid operation.
+
+
+3) smount command
+
+ Currently the mount command is not aware of shared subtree features.
+ Work is in progress to add the support in mount( util-linux package).
+ Till then use the following program.
+
+ ------------------------------------------------------------------------
+ //
+ //this code was developed my Miklos Szeredi <miklos@xxxxxxxxxx>
+ //and modified by Ram Pai <linuxram@xxxxxxxxxx>
+ // sample usage:
+ // smount /tmp shared
+ //
+ #include <stdio.h>
+ #include <stdlib.h>
+ #include <unistd.h>
+ #include <sys/mount.h>
+ #include <sys/fsuid.h>
+
+ #ifndef MS_REC
+ #define MS_REC 0x4000 /* 16384: Recursive loopback */
+ #endif
+
+ #ifndef MS_SHARED
+ #define MS_SHARED 1<<20 /* Shared */
+ #endif
+
+ #ifndef MS_PRIVATE
+ #define MS_PRIVATE 1<<18 /* Private */
+ #endif
+
+ #ifndef MS_SLAVE
+ #define MS_SLAVE 1<<19 /* Slave */
+ #endif
+
+ #ifndef MS_UNCLONE
+ #define MS_UNCLONE 1<<17 /* UNCLONE */
+ #endif
+
+ int main(int argc, char *argv[])
+ {
+ int type;
+ if(argc != 3) {
+ fprintf(stderr, "usage: %s dir "
+ "<rshared|rslave|rprivate|runclonable|shared|slave"
+ "|private|unclonable>\n" , argv[0]);
+ return 1;
+ }
+
+ fprintf(stdout, "%s %s %s\n", argv[0], argv[1], argv[2]);
+
+ if (strcmp(argv[2],"rshared")==0)
+ type=(MS_SHARED|MS_REC);
+ else if (strcmp(argv[2],"rslave")==0)
+ type=(MS_SLAVE|MS_REC);
+ else if (strcmp(argv[2],"rprivate")==0)
+ type=(MS_PRIVATE|MS_REC);
+ else if (strcmp(argv[2],"runclonable")==0)
+ type=(MS_UNCLONE|MS_REC);
+ else if (strcmp(argv[2],"shared")==0)
+ type=MS_SHARED;
+ else if (strcmp(argv[2],"slave")==0)
+ type=MS_SLAVE;
+ else if (strcmp(argv[2],"private")==0)
+ type=MS_PRIVATE;
+ else if (strcmp(argv[2],"unclonable")==0)
+ type=MS_UNCLONE;
+ else {
+ fprintf(stderr, "invalid operation: %s\n", argv[2]);
+ return 1;
+ }
+ setfsuid(getuid());
+ if(mount("", argv[1], "ext2", type, "") == -1) {
+ perror("mount");
+ return 1;
+ }
+ return 0;
+ }
+ -----------------------------------------------------------------------
+
+ Copy the above code snippet into smount.c
+ gcc -o smount smount.c
+
+
+ (i) To mark all the mounts under /mnt as shared execute the following
+ command:
+
+ smount /mnt rshared
+ the corresponding syntax planned for mount command is
+ mount --make-rshared /mnt
+
+ just to mark a mount /mnt as shared, execute the following
+ command:
+ smount /mnt shared
+ the corresponding syntax planned for mount command is
+ mount --make-shared /mnt
+
+ (ii) To mark all the shared mounts under /mnt as slave execute the
+ following
+
+ command:
+ smount /mnt rslave
+ the corresponding syntax planned for mount command is
+ mount --make-rslave /mnt
+
+ just to mark a mount /mnt as slave, execute the following
+ command:
+ smount /mnt slave
+ the corresponding syntax planned for mount command is
+ mount --make-slave /mnt
+
+ (iii) To mark all the mounts under /mnt as private execute the
+ following command:
+
+ smount /mnt rprivate
+ the corresponding syntax planned for mount command is
+ mount --make-rprivate /mnt
+
+ just to mark a mount /mnt as private, execute the following
+ command:
+ smount /mnt private
+ the corresponding syntax planned for mount command is
+ mount --make-private /mnt
+
+ NOTE: by default all the mounts are created as private. But if
+ you want to change some shared/slave/unclonable mount as
+ private at a later point in time, this command can help.
+
+ (iv) To mark all the mounts under /mnt as unclonable execute the
+ following
+
+ command:
+ smount /mnt runclonable
+ the corresponding syntax planned for mount command is
+ mount --make-runclonable /mnt
+
+ just to mark a mount /mnt as unclonable, execute the following
+ command:
+ smount /mnt unclonable
+ the corresponding syntax planned for mount command is
+ mount --make-unclonable /mnt
+
+
+
+4) Use cases
+------------
+
+ A) A process wants to clone its own namespace, but still wants to
+ access the CD that got mounted recently.
+
+ Solution:
+
+ The system administrator can make the mount at /cdrom shared
+ mount --bind /cdrom /cdrom
+ mount --make-shared /cdrom
+
+ Now any process that clone off a new namespace will have a mount
+ at /cdrom which is a replica of the same mount in the parent namespace.
+
+ So when a CD is inserted and mounted at /cdrom that mount gets
+ propagated to the other mount at /cdrom in all the other clone
+ namespaces.
+
+ B) A process wants its mounts invisible to any other process, but
+ still be able to see the other system mounts.
+
+ Solution:
+
+ To begin with the administrator can mark the entire mount tree
+ as shareable.
+
+ mount --make-rshared /
+
+ A new process can clone off a new namespace. And mark some part of
+ its namespace as slave
+
+ mount --make-rslave /myprivatetree
+
+ Hence forth any mounts within the /myprivatetree done by the process
+ will not show up in any other namespace. However mounts done in the
+ parent namespace under /myprivatetree still shows up in the
+ process's namespace.
+
+
+ Apart from the above semantics this feature provides the building
+ blocks to solve the following problems:
+
+ C) Per-user namespace
+ The above semantics allows a way to share mounts across namespaces.
+ But namespaces are associated with processes. If namespaces are made
+ first class objects with user API to associate/disassociate a namespace
+ with userid, then each user could have his/her own namespace and tailor
+ it to his/her requirements. Offcourse its needs support from PAM.
+
+ D) Versioned files
+
+ If the entire mount tree is visible at multiple locations, then a
+ underlying versioning file system can return different version of the
+ file depending on the path used to access that file.
+
+ An example is:
+
+ mount --make-shared /
+ mount --rbind / /view/v1
+ mount --rbind / /view/v2
+ mount --rbind / /view/v3
+ mount --rbind / /view/v4
+
+ and if at /usr there is a versioning filesystem mounted, that mount
+ appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and /view/v4/usr
+ too
+
+ A user can request v3 version of the file /usr/fs/namespace.c by
+ accessing /view/v3/usr/fs/namespace.c . The underlying versioning
+ filesystem can then decipher that v3 version of the filesystem is being
+ requested and return the corresponding inode.
+
+ E) Information leakage
+
+ Many programs leave garbage in /tmp and other directories which
+ other users on the system can observe.(covert channels?)
+
+ The way to solve this is to have per-process-namespace and have
+ unclonable mounts at /tmp for each namespaces.
+
+
+
+5) Detailed semantics:
+-------------------
+ The section below explains the detailed semantics of
+ bind, rbind, move, mount, umount and clone-namespace operations.
+
+5A) Bind semantics
+
+ Consider the following command
+
+ mount --bind A/a B/b
+
+ where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B'
+ is the destination mount and 'b' is the dentry in the destination mount.
+
+ The outcome depends on the type of mount of 'A' and 'B'. The table
+ below contains quick reference.
+ --------------------------------------------------------------------
+ | BIND MOUNT OPERATION |
+ |******************************************************************|
+ |dest(B)-->| shared | private | slave |unclonable |
+ | source(A)| | | | |
+ | | | | | | |
+ | v | | | | |
+ |******************************************************************|
+ | | | | | |
+ | shared | shared | shared |shared |shared |
+ | | | | | |
+ | | | | | |
+ | private | shared | private | private | private |
+ | | | | | |
+ | | | | | |
+ | slave | shared | slave | slave | slave |
+ | | | | | |
+ | | | | | |
+ |unclonable| invalid | invalid | invalid | invalid |
+ | | | | | |
+ | | | | | |
+ ********************************************************************
+
+ Details follow:
+
+ 1. 'A' is a private mount and 'B' is a private mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'.
+
+ 2. 'A' is a shared mount and 'B' is a private mount. A new mount 'C'
+ which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. Also 'C' is set for propagation
+ with 'A'. In other words 'A' and 'C' propagate to each other.
+
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a private mount. A new
+ mount 'C' which is a clone of 'A' is created. Its root dentry is 'a'.
+ 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a slave
+ mount of 'Z'. In other words 'A' and 'C' are both slave mounts of 'Z'.
+ All mount/unmount events on 'Z' propagates to 'A' and 'C'. But
+ mount/unmount on 'A' does not propagate anywhere else. Similarly
+ mount/unmount on 'C' does not propagate anywhere else.
+
+ 4. 'A' is a unclonable mount and 'B' is a private mount. This is a
+ invalid operation. A unclonable mount cannot be bind mounted.
+
+ 5. 'A' is a private mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree is set containing all new mounts
+ 'C1', .., 'Cn' exactly with the same configuration as the propagation
+ tree of 'B'.
+
+ 6. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ...
+ are created and mounted at the dentry 'b' on all mounts where 'B'
+ propagates to. A new propagation tree is set for all the new mounts
+ 'C1',..,'Cn' with exactly the same configuration as the propagation
+ tree of 'B'. And finally the mount 'C' and 'A' are set to propagate to
+ each other.
+
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new
+ mount 'C' which is clone of 'A', is created. Its root dentry is 'a' .
+ 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2',
+ 'C3' ... are created and mounted at the dentry 'b' on all mounts where
+ 'B' propagates to. A new propagation tree is set for all the new mounts
+ 'C1',.. 'Cn' with exactly the same configuration as the propagation
+ tree of 'B'. And finally the mount 'C' is made the slave of mount 'Z'.
+
+ 8. 'A' is a unclonable mount and 'B' is a shared mount. This is a
+ invalid operation.
+
+ 9. 'A' is a private mount and 'B' is a slave mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'.
+
+ 10. 'A' is a shared mount and 'B' is a slave mount. A new mount 'C'
+ which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'. And finally the mount 'C' and 'A'
+ set to propagate to each other.
+
+ 11. 'A' is a slave mount of mount 'Z' and 'B' is a slave mount. A new
+ mount 'C' which is clone of 'A' is created. Its root dentry is 'a'. 'C'
+ is mounted on mount 'B' at dentry 'b'. And finally the mount 'C' is
+ made the slave of mount 'Z'.
+
+ 12. 'A' is a unclonable mount and 'B' is a slave mount. This is a
+ invalid operation.
+
+ 13. 'A' is a private mount and 'B' is a unclonable mount. A new mount
+ 'C' which is clone of 'A', is created. Its root dentry is 'a' . 'C' is
+ mounted on mount 'B' at dentry 'b'.
+
+ 14. 'A' is a shared mount and 'B' is a unclonable mount. A new mount
+ 'C' which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is
+ mounted on mount 'B' at dentry 'b'. Also 'C' is set for propagation
+ with 'A'. In other words 'A' and 'C' are propagation peers of each
+ other.
+
+ 15. 'A' is a slave mount of mount 'Z' and 'B' is a unclonable mount. A
+ new mount 'C' which is a clone of 'A' is created. Its root dentry is
+ 'a'. This mount is mounted on mount 'B' at dentry 'b'. Also 'C' is set
+ as a slave mount of 'Z'.
+
+ 16. 'A' is a unclonable mount and 'B' is a unclonable mount. This is a
+ invalid operation.
+
+
+5B) Rbind semantics
+ rbind is same as bind. Bind replicates the specified mount. Rbind
+ replicates all the mounts in the tree belonging to the specified mount.
+ Rbind mount is bind mount applied to all the mounts in the tree.
+
+ If the source tree that is rbind has some unclonable mounts,
+ then the subtree under the unclonable mount is pruned in the new
+ location.
+
+ eg: lets say we have the following mount tree.
+
+ A
+ / \
+ B C
+ / \ / \
+ D E F G
+
+ Lets say all the mount except C mount in the tree are something
+ other than unclonable.
+
+ If this tree is rbound to say Z
+
+ We will have the following tree at the new location.
+
+ Z
+ |
+ A'
+ /
+ B' Note: how the tree under C is pruned
+ / \ in the new location.
+ D' E'
+
+
+
+5C) Move semantics
+
+ Consider the following command
+
+ mount --move A B/b
+
+ where 'A' is the source mount, 'B' is the destination mount and 'b' is
+ the dentry in the destination mount.
+
+ The outcome depends on the type of the mount of 'A' and 'B'. The table
+ below is a quick reference.
+ --------------------------------------------------------------------
+ | MOVE MOUNT OPERATION |
+ |******************************************************************|
+ |dest(B)-->| shared | private | slave |unclonable |
+ | source(A)| | | | |
+ | | | | | | |
+ | v | | | | |
+ |******************************************************************|
+ | | | | | |
+ | shared | shared | shared |shared | shared |
+ | | | | | |
+ | | | | | |
+ | private | shared | private | private | private |
+ | | | | | |
+ | | | | | |
+ | slave | shared | slave | slave | slave |
+ | | | | | |
+ | | | | | |
+ |unclonable| invalid | unclonable |unclonable| unclonable|
+ | | | | | |
+ | | | | | |
+ *******************************************************************
+ NOTE: moving a mount residing under a shared mount is invalid.
+
+ Details follow:
+
+ 1. 'A' is a private mount and 'B' is a private mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'.
+
+ 2. 'A' is a shared mount and 'B' is a private mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a shared
+ mount.
+
+ 3. 'A' is a slave mount of mount 'Z' and 'B' is a private mount. The
+ mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues
+ to be a slave mount of mount 'Z'.
+
+ 4. 'A' is a unclonable mount and 'B' is a private mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ unclonable mount.
+
+ 5. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. The mount 'A' becomes a shared mount and a
+ propagation tree is created in the exact same configuration as that of
+ 'B'. This new propagation tree contains all the new mounts 'A1',
+ 'A2'... 'An'.
+
+ 6. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An'
+ are created and mounted at dentry 'b' on all mounts that receive
+ propagation from mount 'B'. A new propagation tree is created in the
+ exact same configuration as that of 'B'. This new propagation tree
+ contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree
+ of 'A'.
+
+ 7. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The
+ mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1',
+ 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that
+ receive propagation from mount 'B'. A new propagation tree is created
+ in the exact same configuration as that of 'B'. This new propagation
+ tree contains all the new mounts 'A1', 'A2'... 'An'. And this new
+ propagation tree is appended to the already existing propagation tree of
+ 'A'. Mount 'A' continues to be the slave mount of 'Z'.
+
+ 8. 'A' is a unclonable mount and 'B' is a shared mount. The operation
+ is invalid. Because mounting anything on the shared mount 'B' can
+ create new mounts that get mounted on the mounts that receive
+ propagation from 'B'. And since the mount 'A' is unclonable, cloning
+ it to mount at other mountpoints is not possible.
+
+ 9. 'A' is a private mount and 'B' is a slave mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a private
+ mount.
+
+ 10. 'A' is a shared mount and 'B' is a slave mount. The mount 'A' is
+ mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a shared
+ mount.
+
+ 11. 'A' is a slave mount of mount 'Z' and 'B' is slave mount. The
+ mount A is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to
+ be a slave mount of mount Z.
+
+ 12. 'A' is a unclonable mount and 'B' is a slave mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ unclonable mount.
+
+ 13. 'A' is a private mount and 'B' is a unclonable mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. mount 'A' continues to be a
+ private mount.
+
+ 14. 'A' is a shared mount and 'B' is a unclonable mount. The mount 'A'
+ is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ shared mount.
+
+ 15. 'A' is a slave mount of mount 'Z' and 'B' is unclonable mount. The
+ mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues
+ to be a slave mount of mount Z.
+
+ 16. 'A' is a unclonable mount and 'B' is a unclonable mount. The mount
+ 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a
+ unclonable mount.
+
+5D) Mount semantics
+
+ Consider the following command
+
+ mount device B/b
+
+ 'B' is the destination mount and 'b' is the dentry in the destination
+ mount.
+
+ The above operation is the same as bind operation with the exception
+ that the source mount is always a private mount.
+
+
+5E) Unmount semantics
+
+ Consider the following command
+
+ umount A
+
+ where 'A' is a mount mounted on mount 'B' at dentry 'b'.
+
+ If mount 'B' is shared, then all most-recently-mounted mounts at dentry
+ 'b' on mounts that receive propagation from mount 'B' and does not have
+ sub-mounts within them are unmounted.
+
+ Example: Lets say 'B1', 'B2', 'B3' are shared mounts that propagate to
+ each other.
+
+ lets say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount
+ 'B1', 'B2' and 'B3' respectively.
+
+ lets say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on
+ mount 'B1', 'B2' and 'B3' respectively.
+
+ if 'C1' is unmounted, all the mounts that are most-recently-mounted on
+ 'B1' and on the mounts that 'B1' propagates-to are unmounted.
+
+ 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount
+ on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'.
+
+ So all 'C1', 'C2' and 'C3' should be unmounted.
+
+ If any of 'C2' or 'C3' has some child mounts, then that mount is not
+ unmounted, but all other mounts are unmounted. However if C1 is told to
+ be unmounted and C1 has some sub-mounts, the umount operation is failed
+ entirely.
+
+5F) Clone Namespace
+
+ A cloned namespace has all the mounts as that of the parent namespace,
+ except that it skips all the mounts under a unclonable mount.
+
+ Lets say 'A' and 'B' are the corresponding mounts in the parent and the
+ child namespace.
+
+ If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to
+ each other.
+
+ If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of
+ 'Z'.
+
+ If 'A' is a private mount, then 'B' is a private mount too.
+
+ If 'A' is unclonable mount, 'B' does not exist.
+
+6F) Misc Semantics
+
+ A given mount can be in one of the states
+ 1) shared
+ 2) slave
+ 3) shared and slave
+ 4) private
+ 5) unclonable
+
+ Note the state 'shared and slave'. This state indicates that the mount
+ is a slave of some master mount, and it is shared too. This mount
+ receives propogation events from the master mount, and also forwards
+ propagation events to its shared peers and its slave mounts.
+
+ Only a shared mount can be made a slave by executing the following
+ command
+ mount --make-slave mount
+ A shared mount that is made as a slave will not be shared anymore.
+
+ Only a slave mount can be made as 'shared and slave' by executing
+ the following command
+ mount --make-shared mount
+
+6) Quiz
+
+ A. What happens when the following set of commands are executed?
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt /tmp
+ mount --move /tmp /mnt/1
+
+ what should be the contents of /mnt /mnt/1 /mnt/1/1 should be?
+ Should they all be identical? or should /mnt and /mnt/1 be
+ identical only?
+
+ B. What happens when the following set of commands are executed?
+
+ mount --make-rshared /
+ mkdir -p /v/1
+ mount --rbind / /v/1
+
+ what should be the content of /v/1/v/1 be?
+
+
+ C. What happens when the following set of commands are executed
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1/2/3 /mnt/1/test
+ mount --bind /mnt/1 /tmp
+ mount --make-slave /mnt
+ mount --make-shared /mnt
+ mount --bind /mnt/1/2 /tmp1
+ mount --make-slave /mnt
+
+ At this point we have the first mount at /tmp and
+ its root dentry is 1. Lets call this mount 'A'
+ And then we have a second mount at /tmp1 with root
+ dentry 2. Lets call this mount 'B'
+ Next we have a third mount at /mnt with root dentry
+ mnt. Lets call this mount 'C'
+
+ 'B' is the slave of 'A' and 'C' is a slave of 'B'
+ A -> B -> C
+
+ at this point if we execute the following command
+
+ mount --bind /bin /tmp/test
+
+ The mount is attempted on 'A'
+
+ will the mount propagate to 'B' and 'C' ?
+
+ what would be the contents of
+ /mnt/1/test be?
+
+
+ D. What happens when the following set of commands are executed?
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1 /mnt/2
+ mount --bind /usr /mnt/1
+ mount --bind /mnt /mnt/2
+
+ a) mount --bind /var /mnt/2
+ what should be the contents of /mnt/1 and /mnt/2 be?
+
+ b) mount --bind /var /mnt/1
+ what should the contents of /mnt/1 and /mnt/2 be?
+
+7) FAQ
+
+ Q1. Why is bind mount needed? How is it different from symbolic links?
+ symbolic links can get stale if the destination mount gets unmounted
+ or moved. Bind mounts continue to exist even if the other mount is
+ unmounted or moved.
+
+ Q2. Why can't the shared subtree be implemented using exportfs?
+ exportfs is a heavyweight way of accomplishing part of what shared
+ subtree can do. I cannot imagine a way to implement the semantics of
+ slave mount using exportfs?
+
+ Q3 Why is unclonable mount needed?
+ if one rbind mounts a tree within the same subtree 'n' times
+ the number of mounts created is a exponential function of 'n'.
+ Having unclonable mount can help prune the unneeded bind mounts.
+
+8) Bugs
+
+ Current VFS implementation makes the most-recent-mount visible
+ instead of making the top-most mount visible.
+
+ This edge-case shows up when multiple mounts are mounted on
+ the same dentry of a given mount.
+
+ consider the following command sequence
+
+ (1) cd /mnt
+ (2) mount --bind /usr /mnt
+ (3) mount --bind /bin /mnt
+ (4) mount --bind /var .
+
+ after step 1, the pwd of the process points to the 'mnt' dentry
+ of the root mount. lets call the root mount as 'A'
+
+ after step 2, a new mount is laid on top of 'A' at the mountpoint
+ 'mnt'. lets call this mount 'B'
+
+ after step 3, a new mount is laid on top of 'B' on the root dentry of
+ 'B'. lets call this new overlaid mount as 'C'. At this point the
+ visible content of /mnt is the content of 'C'.
+
+ however at step 4, a new mount is laid on top of 'A' at the same
+ mountpoint 'mnt' as that of 'B'. Lets call the new mount 'D'.
+
+ Note mount 'B' resides on top of 'A', and mount 'C' is mounted on top
+ of 'A' at the same mountpoint as that of 'B'.
+
+ Since 'B' is above 'A' and 'C' is below 'B' but above 'A', one would
+ naturally expect 'B' to continue to be visible.
+
+ But that is not the case. 'C' becomes visible as per the current
+ implementation of VFS.
+
+ This semantics if extended to shared subtree can cause mind boggling
+ confusion.
+
+ Here is a scenario with shared subtree. Sorry it is complex.
+
+ mount --bind /mnt /mnt
+ mount --make-shared /mnt
+ mkdir -p /mnt/1 /mnt/2
+ mount --bind /usr /mnt/1
+ mount --bind /mnt /mnt/2
+
+ At this stage the mount at /mnt/2 and /mnt belong to the same pnode
+ which means mounts under them propagate to each other.
+
+ mount --bind /var /mnt/1
+
+ the contents of /var will be visible under /mnt/1 and not under /mnt/2
+ Instead if 'mount --bind /var /mnt/2' is executed, the contents of /var
+ is visible under /mnt/1 as well as /mnt/2 .
+
+ On analysis it turns out the culprit is the current rule which says
+ 'expose the most-recent-mount and not the topmost mount'
+
+ The current implementation of shared subtree has not changed the
+ semantics for the normal case. But has implemented the
+ top-most-mount-visible semantics for mounts that happen in the context
+ of shared-subtree. This could be perceived as a bug! This issue needs
+ some collective thought.
+
+
+9) Implementation
+
+ 4 new fields are added to struct vfsmount
+ ->mnt_share
+ ->mnt_slave_list
+ ->mnt_slave
+ ->mnt_master
+
+ ->mnt_share links together all the mount to/from which this mount
+ send/receives mount/umount propagations.
+
+ ->mnt_slave_list links all the mounts to which this mount propagates to.
+
+ ->mnt_slave links together all the slaves that its master mount
+ propagates to.
+
+ ->mnt_master points to the master mount from which this mount receives
+ propagation.
+
+
+ ->mnt_flags takes two more flags to indicate the propagation status of
+ the mount. MNT_SHARE indicates that the mount is a shared mount.
+ MNT_UNCLONABLE indicates that the mount cannot be replicated.
+
+
+ A example propagation tree looks as shown in the figure below.
+ [ NOTE: Though it looks like a forest, if we consider all the shared
+ mounts as a conceptual entity called 'pnode', it becomes a tree]
+
+
+ A <--> B <--> C <---> D
+ /|\ /| |\
+ / F G J K H I
+ /
+ E<-->K
+ /|\
+ M L N
+
+ In the above figure A,B,C and D all are shared and propagate to each
+ other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave
+ mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'.
+ 'E' is also shared with 'K' and they propagate to each other. And
+ 'K' has 3 slaves 'M', 'L' and 'N'
+
+ A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D'
+
+ A's ->mnt_slave_list links with ->mnt_slave of 'E', 'F' and 'G'
+
+ E's ->mnt_share links with ->mnt_share of K
+ 'E', 'F', 'G' have their ->mnt_master point to struct vfsmount of 'A'
+ 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K'
+ K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N'
+
+ C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K'
+ J and K's ->mnt_master points to struct vfsmount of C
+ and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I'
+ 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'.
+
+
+ The propagation tree is orthogonal to the mount tree.
+ One of the most complex operation is
+ mount --move A B
+
+ where 'A' contains a mount tree.
+ 'A' has its own propagation tree and 'B' has its own propagation tree.
+
+ The overall algorithm breaks the operation into 3 phases:
+ (look at attach_recursive_mnt() and propagate_prepare_mount())
+
+ 1. prepare phase.
+ 2. commit phases.
+ 3. abort phases.
+
+ Prepare phase:
+
+ for each mount in the source tree:
+ 1. unlink the mount from its ->mnt_list
+ 2. a) attach that mount to the destination
+ b) create the necessary number of clone mounts that propagate to
+ all the mounts that the destination mount propagates to.
+ c) do not attach the mount to destination, however note down
+ in its ->mnt_parent and ->mnt_mountpoint location
+ by holding a reference to them.
+ c) link all the new mounts to form a propagation tree that is
+ identical to the propagation tree of the destination mount.
+ d) link all these new mounts together through ->mnt_list
+
+ If this phase is successful, there should be 'n' new propagation
+ trees; where 'n' is the number of mounts in the source tree.
+ Go to the commit phase
+
+ if any memory allocation fails, or any thing else fails, go to the
+ abort phase.
+
+ Commit phase
+ for each mount in the source tree (say A)
+ walk that mounts mnt_list and for each mount (say B)
+ a) delink its ->mnt_list
+ a) attach the mount to its parent. (->mnt_child and ->mnt_hash
+ gets linked)
+ b) add the mount to the parent's namespace
+ c) mark the mount as MNT_SHARE if the parent mount is MNT_SHARE
+ d) add the ->mnt_expire to the list of that of 'A'
+
+ Abort phase
+ for each mount in the source tree (say A)
+ walk that mount's mnt_list and for each mount
+ a) delink it from its propagation tree
+ b) delete the mount if was newly cloned.
+ e) release any references it held to its parent mount and
+ the mountpoint.
+
+
+
+ mount --rbind A B
+
+ is similar to --move operation. Here the source tree is not exactly
+ that of 'A', but a clone of the tree at A. Hence the additional
+ operation is to link the propagation tree created in the prepare
+ phase to that of 'A's propagation tree.
+
+ All other operations are trivial and should be clear by looking at the
+ code.
+
+ NOTE: all the propagation related functionality resides in the file
+ pnode.c
+
+------------------------------------------------------------------------
+
+version 0.1 (created the initial document, Ram Pai linuxram@xxxxxxxxxx)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/