Re: pivot_root(".", ".") and the fchdir() dance

From: Eric W. Biederman
Date: Mon Sep 30 2019 - 07:43:10 EST


"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

> Hello Eric,
>
> A ping on my question below. Could you take a look please?
>
> Thanks,
>
> Michael
>
>>>>> The concern from our conversation at the container mini-summit was that
>>>>> there is a pathology if in your initial mount namespace all of the
>>>>> mounts are marked MS_SHARED like systemd does (and is almost necessary
>>>>> if you are going to use mount propagation), that if new_root itself
>>>>> is MS_SHARED then unmounting the old_root could propagate.
>>>>>
>>>>> So I believe the desired sequence is:
>>>>>
>>>>>>>> chdir(new_root);
>>>>> +++ mount("", ".", MS_SLAVE | MS_REC, NULL);
>>>>>>>> pivot_root(".", ".");
>>>>>>>> umount2(".", MNT_DETACH);
>>>>>
>>>>> The change to new new_root could be either MS_SLAVE or MS_PRIVATE. So
>>>>> long as it is not MS_SHARED the mount won't propagate back to the
>>>>> parent mount namespace.
>>>>
>>>> Thanks. I made that change.
>>>
>>> For what it is worth. The sequence above without the change in mount
>>> attributes will fail if it is necessary to change the mount attributes
>>> as "." is both put_old as well as new_root.
>>>
>>> When I initially suggested the change I saw "." was new_root and forgot
>>> "." was also put_old. So I thought there was a silent danger without
>>> that sequence.
>>
>> So, now I am a little confused by the comments you added here. Do you
>> now mean that the
>>
>> mount("", ".", MS_SLAVE | MS_REC, NULL);
>>
>> call is not actually necessary?

Apologies for being slow getting back to you.

To my knowledge there are two cases where pivot_root is used.
- In the initial mount namespace from a ramdisk when mounting root.
This is the original use case and somewhat historical as rootfs
(aka an initial ramfs) may not be unmounted.

- When setting up a new mount namespace to jettison all of the mounts
you don't need.

The sequence:

chdir(new_root);
pivot_root(".", ".");
umount2(".", MNT_DETACH);

is perfect for both use cases (as nothing needs to be known about the
directory layout of the new root filesystem).

In the case when you are setting up a new mount namespace propogating
changes in the mount layout to another mount namespace is fatal. But
that is not a concern for using that pivot_root sequence above because
pivot_root will fail deterministically if
'mount("", ".", MS_SLAVE | MS_REC, NULL)' is needed but not specified.

So I would document the above sequence of three system calls in the
man-page.

I would document that pivot_root will fail if propagation would occur.

I would document in pivot_root or under unshare(CLONE_NEWNS) that if
mount propagation is enabled (the default with systemd) that you
need to call 'mount("", "/", MS_SLAVE | MS_REC, NULL);' or
'mount("", "/", MS_PRIVATE | MS_REC, NULL);' after creating a mount
namespace. Or mounts will propagate backwards, which is usually
not what people want.

Creating of a mount namespace in a user namespace automatically does
'mount("", "/", MS_SLAVE | MS_REC, NULL);' if the starting mount
namespace was not created in that user namespace. AKA creating
a mount namespace in a user namespace does the unshare for you.

Eric