Hi! Can we restart the discussion on this topic?
In CRIU we need to be able to dump/restore all mount trees of system container (CT). CT can have anything inside - users which create their custom mounts configuration, systemd with custom mount namespaces for it's services, nested application containers inside the CT with their own mount namespaces, and all mounts in CT mount trees can be grouped by sharing groupes (e.g. same shared_id + master_id pair), and those groups can depend one from another forming a tree structure of sharing groups.
1) Imagine that we have this sharing group tree (in format (shared_id, master_id), 0 means no sharing, we don't care about actual mounts for now only master-slave dependencies between sharing groups):
(1,0)
|- (2,1)
|- (3,1)
|- (4,3)
|- (0,4)
The main problem of restoring mounts is the fact that sharing groups currently can be only inherited, e.g. if you have one mount (first) with shared_id = x, master_id = y, the only way to get another mount with (x,y) is to create a bindmount from the first mount. Also to create mount (y,z) from mount (x,y) one should also first inherit (x,y) via bindmount and than change to (y,z).
This means that mentioned above tree puts restriction on the mounts creation order, one need to have at least one mount for each of sharing groups (1,0), (3,1) and (4,3) before creating the first mount of the sharing group (0,4).
But what if we want to mount (restore) actual mounts in this mount tree "reverse" order:
mntid parent mountpoint (shared_id, master_id)
101 0 /tmp (0,4)
102 101 /tmp (4,3)
103 102 /tmp (3,1)
104 103 /tmp (1,0)
Mount 104's sharing group should be created before mount 101, 102 and 103 sharing groups, but mount 104 should be created after those mounts. One can actually prepare this setup (on mainstream kernel) by pre-creating sharing groups elsewhere and then binding to /tmp in proper order with careful unmounting of propagations (see test.sh attached):
[root@snorch propagation-tests]# bash ../test.sh
------------
960 1120 0:56 / /tmp/propagation-tests/tmp rw,relatime master:452 - tmpfs propagation-tests-src rw,inode64
958 960 0:56 / /tmp/propagation-tests/tmp/sub rw,relatime shared:452 master:451 - tmpfs propagation-tests-src rw,inode64
961 958 0:56 / /tmp/propagation-tests/tmp/sub/sub rw,relatime shared:451 master:433 - tmpfs propagation-tests-src rw,inode64
963 961 0:56 / /tmp/propagation-tests/tmp/sub/sub/sub rw,relatime shared:433 - tmpfs propagation-tests-src rw,inode64
------------
But this "pre-creating" from test.sh is not universal at all and only works for this simple case. CRIU does not know anything about the history of mount creation for system container, it also does not know anything about any temporary mounts which were used and then removed. So understanding the proper order is almost impossible like Andrew says.
I've also prepared a presentation on Linux Plumbers last year about how much problems propagation brings to mounts restore in CRIU, you can take a look here https://www.linuxplumbersconf.org/event/7/contributions/640/
2) Propagation creates tons of mounts
3) Mount reparenting
4) "Mount trap"
5) "Non-uniform" propagation
6) “Cross-namespace” sharing groups
Allowing to create mounts private first and create sharing groups later and copy sharing groups later instead of inheriting them resolves all the problems with propagation at once.
One can take a look on the implementation of sharing group restore in CRIU if we have this (mnt: allow to add a mount into an existing group) patch applied: https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L898
Obviously this does not solve all the problems with mounts I know about but it's a big step forward in properly supporting them in CRIU. We already have this tested in Virtuozzo for almost a year and it works nice.
Notes:
- There is another idea, but I should say early that I don't like it, because with it restoring mounts with criu would be still super complex. We can add extra flag to mount/move_mount syscall to disable propagation temporary so that CRIU can restore the mount tree without problems 2-5, also we can now create cross-namespace bindmounts with (copy_tree+move_mount) to solve 6. But this solution does not help much with problem 1 - ordering and the need of temporary mounts. As you can see in test.sh you would still need to think hard to solve different similar configurations of reverse order between mounts and sharing groups.
- We can actually prohibit cross-namespace MS_SET_GROUP if you like. (If both namespaces are non abstract.) We can use open_tree to create a copy of the mount with the same sharing group and only then copy sharing from the copy while being in proper mountns.
- We still need it:
> this code might be made unnecessary by allowing bind mounts between
> mount namespaces.
No, because of problem 1. Guessing right order would be still to complex.
- This approach does not allow creation of any "bad" trees.
> Can they create loops in mount propagation trees that we can not create today?
There would be no loops in "sharing groups tree" for sure, as this new MS_SET_GROUP only adds one _private_ mount to one group (without moving between groups), the tree itself is unchanged after mount(MS_SET_GROUP).
- Probably mount syscall is not the right place for MS_SET_GROUP. I see new syscall mount_setattr, first I thought reworking MS_SET_GROUP to be a part of it, but interface of mount_setattr for copying is not convenient. Probably we can add MS_SET_GROUP flag to move_mount which has exactly what we want path to source and destination relative to fd:
static inline int move_mount(int from_dfd, const char *from_pathname,
int to_dfd, const char *to_pathname,
unsigned int flags)
As in mount-v2 now I had to use proc hacks to access mounts at dfd:
https://github.com/Snorch/criu/blob/bebbded98128ec787950fa8365a6c74ce6a3b2cb/criu/mount-v2.c#L923
- I hope that we still have a chance for MS_SET_GROUP, this way we can port mount-v2 to mainstream CRIU.