Re: [PATCH 03/34] teach move_mount(2) to work with OPEN_TREE_CLONE [ver #12]
From: Alan Jenkins
Date: Thu Oct 11 2018 - 07:48:50 EST
On 11/10/2018 10:17, David Howells wrote:
Alan Jenkins <alan.christopher.jenkins@xxxxxxxxx> wrote:
# unshare --mount=private_mnt/child_ns --propagation=shared ls -l /proc/self/ns/mnt
I think the problem is that the mount of the nsfs object done by unshare here
pins the new mount namespace - but doesn't add the namespace's contents into
the mount tree, so the mount struct cycle-detection code is bypassed.
I think it's fine for all other namespaces, just not the mount namespace.
It looks like this bug might theoretically exist upstream also, though I don't
think there's any way to actually effect it given that mount() doesn't take a
dirfd argument.
The reason that you can do this with open_tree()/move_mount() is that it
allows you to create a mount tree (OPEN_TREE_CLONE) that has no namespace
assignment, pass it through the namespace switch and then attach it inside the
child namespace. The cross-namespace checks in do_move_mount() are bypassed
because the root of the newly-cloned mount tree doesn't have one.
Unfortunately, just searching the newly-cloned mount tree for a conflicting
nsfs mount doesn't help because the potential loop could be hidden several
levels deep.
I think the simplest solution is to either reject a request for
open_tree(OPEN_TREE_CLONE) if there are any nsfs objects in the source tree,
or to just not copy said objects.
David
Very clearly written, thank you. Hum, your solution would mean
open_tree(OPEN_TREE_CLONE) + move_mount() is not equivalent to the
current `mount --rbind` :-(. That does not fit the current patch
description.
It sounds like you're under-estimating how we can use mnt_ns->seq (as is
currently used in mnt_ns_loop()). Or maybe I am over-estimating it :).
In principle, it should suffice for attach_recursive_mount() to check
the NS sequence numbers of the NS files which are mounted. You can't
hide the loop at a deeper level inside the NS, because of the existing
mnt_ns_loop() check.
I think mnt_ns_loop() works 100% correctly upstream, and there is no
memory leak bug there. You can pass a mount NS fd between processes in
arbitrary namespaces, and you can mount it with "mount --no-canonicalize
--bind /proc/self/fd/3 /other_ns". But mnt_ns_loop() will only allow
the mount when the other NS is newer than your own mount namespace.
Upstream also covers mount propagation (and CLONE_NEWNS), by simply not
propagating mounts of mount NS files. ( See commit 4ce5d2b1a8fd "vfs:
Don't copy mount bind mounts of /proc/<pid>/ns/mnt between namespaces" /
https://unix.stackexchange.com/questions/473717/what-code-prevents-mount-namespace-loops-in-a-more-complex-case-involving-mount-propagation
)
I think it is more a question of taste :-). Would it be acceptable to
prune the tree (or fail?) in move_mount() (and also `mount --move`, if
you [ab]use it like I did) ?
I suspect we should prefer your solution. It is clearly simpler, and I
don't know that anyone really uses `mount --rbind` to clone trees of
mount NS files.
Either way, I suggest we take care to say whether `mount --rbind` and
`mount --bind` can be implemented using open_tree() + move_mount(), or
whether we think it might be undesirable. (E.g. because someone might
read the current commit message, and desire to implement `mount
--bind,ro` atomically, if/when we also have mount_setattr() ).
Regards
Alan
---
Test script:
mount -t tmpfs none /a
mount --make-shared /a
cd /a
mkdir private_mnt
mount -t tmpfs xxx private_mnt
mount --make-private private_mnt
touch private_mnt/child_ns
unshare --mount=private_mnt/child_ns --propagation=shared \
ls -l /proc/self/ns/mnt
findmnt
~/open_tree 3</a/private_mnt 3 \
nsenter --mount=/a/private_mnt/child_ns \
sh -c '~/move_mount 4</mnt'
grep Shmem: /proc/meminfo
dd if=/dev/zero of=/a/private_mnt/bigfile bs=1M count=10
umount -l /a/private_mnt/
grep Shmem: /proc/meminfo