Re: [PATCH 01/33] vfs: syscall: Add open_tree(2) to reference or clone a mount [ver #11]

From: Alan Jenkins
Date: Thu Aug 02 2018 - 13:31:13 EST


Hi

I found this interesting, though I don't entirely follow the kernel mount/unmount code. I had one puzzle about the code, and two questions which I was largely able to answer.

On 01/08/18 16:24, David Howells wrote:
+void dissolve_on_fput(struct vfsmount *mnt)
+{
+ namespace_lock();
+ lock_mount_hash();
+ mntget(mnt);
+ umount_tree(real_mount(mnt), UMOUNT_SYNC);
+ unlock_mount_hash();
+ namespace_unlock();
+}

Can I ask why UMOUNT_SYNC is used here? I feel like I must have missed something, but doesn't it skip the IS_MNT_LOCKED() check in disconnect_mount() ?

UMOUNT_SYNC seems used for non-lazy unmounts, and in internal cleanups where userspace wouldn't be able to see. But I think userspace can keep watching in this case, e.g. by `fd2 = openat(fd, ".", O_PATH)` (or `fd2 = open_tree(fd, ".", 0)` ?). I would think this function should avoid using UMOUNT_SYNC, like lazy unmount avoids it.

From: Al Viro <viro@xxxxxxxxxxxxxxxxxx>

open_tree(dfd, pathname, flags)

Returns an O_PATH-opened file descriptor or an error.
dfd and pathname specify the location to open, in usual
fashion (see e.g. fstatat(2)). flags should be an OR of
some of the following:
* AT_PATH_EMPTY, AT_NO_AUTOMOUNT, AT_SYMLINK_NOFOLLOW -
same meanings as usual
* OPEN_TREE_CLOEXEC - make the resulting descriptor
close-on-exec
* OPEN_TREE_CLONE or OPEN_TREE_CLONE | AT_RECURSIVE -
instead of opening the location in question, create a detached
mount tree matching the subtree rooted at location specified by
dfd/pathname. With AT_RECURSIVE the entire subtree is cloned,
without it - only the part within in the mount containing the
location in question. In other words, the same as mount --rbind
or mount --bind would've taken.

One of the limitations documented for `mount --bind`, is that `mount -o bind,ro` is not atomic. There's a workaround if you need it, it's just a bit clunky. I wondered if it was possible to improve `mount` by changing the mount flags between OPEN_TREE_CLONE and move_mount().

ÂÂÂ fd = open_tree(..., OPEN_TREE_CLONE);
ÂÂÂ fchdir(fd);
ÂÂÂ mount(NULL, ".", NULL, MS_REMOUNT | MS_BIND | newbindflags, NULL);
ÂÂÂ move_mount(fd, ...);

Another closely-related limitation of `mount`, is that it can't atomically set the propagation type at mount time.

My conclusion was the above doesn't quite work yet. do_remount() still uses check_mnt(), so it doesn't accept detached mounts. I wonder if it can be changed in future.

The detached tree will be
dissolved on the final close of obtained file.

My last question turned out very dull, feel free to ignore.

It seems the only way to use MNT_FORCE[1], is to first attach the filesystem somewhere (and close the file descriptor). I wondered if there was a way to make things more regular. close_and_umount() feels too obscure to live though...

[1] "Ask the filesystem to abort pending requests before attempting theunmount. This may allow the unmount to complete without waitingfor an inaccessible server. If, after aborting requests, someprocesses still have active references to the filesystem, theunmount will still fail."

...and I suppose it's much less useful than I thought. The point of MNT_FORCE is to kick out processes that were blocked _trying to access a file by name_, e.g. open() or stat(). But if we're considering a detached mount, then it's impossible to access it by name alone. You need an fd (or cwd or root), which would stop the filesystem being unmounted anyway. close_and_umount(fd, MNT_FORCE) is pointless unless your process has other threads accessing the filesystem through the same fd, but that's a really bad idea anyway.

It could prevent someone else getting stuck indefinitely on /proc/$PID/fd, but that's still very obscure.

Regards
Alan