Re: pivot_root(".", ".") and the fchdir() dance

From: Eric W. Biederman
Date: Tue Sep 10 2019 - 19:07:19 EST


"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

> Hello Christian,
>
>>> All: I plan to add the following text to the manual page:
>>>
>>> new_root and put_old may be the same directory. In particular,
>>> the following sequence allows a pivot-root operation without needâ
>>> ing to create and remove a temporary directory:
>>>
>>> chdir(new_root);
>>> pivot_root(".", ".");
>>> umount2(".", MNT_DETACH);
>>
>> Hm, should we mention that MS_PRIVATE or MS_SLAVE is usually needed
>> before the umount2()? Especially for the container case... I think we
>> discussed this briefly yesterday in person.
> Thanks for noticing. That detail (more precisely: not MS_SHARED) is
> already covered in the numerous other changes that I have pending
> for this page:
>
> The following restrictions apply:
> ...
> - The propagation type of new_root and its parent mount must not
> be MS_SHARED; similarly, if put_old is an existing mount point,
> its propagation type must not be MS_SHARED.

Ugh. That is close but not quite correct.

A better explanation:

The pivot_root system call will never propagate any changes it makes.
The pivot_root system call ensures this is safe by verifying that
none of put_old, the parent of new_root, and parent of the root directory
have a propagation type of MS_SHARED.

>

The concern from our conversation at the container mini-summit was that
there is a pathology if in your initial mount namespace all of the
mounts are marked MS_SHARED like systemd does (and is almost necessary
if you are going to use mount propagation), that if new_root itself
is MS_SHARED then unmounting the old_root could propagate.

So I believe the desired sequence is:

>>> chdir(new_root);
+++ mount("", ".", MS_SLAVE | MS_REC, NULL);
>>> pivot_root(".", ".");
>>> umount2(".", MNT_DETACH);

The change to new new_root could be either MS_SLAVE or MS_PRIVATE. So
long as it is not MS_SHARED the mount won't propagate back to the
parent mount namespace.

Eric