Re: For review: rewritten pivot_root(2) manual page

From: Eric W. Biederman
Date: Wed Oct 09 2019 - 12:01:58 EST


"Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

> Hello Eric,
>
> Thank you. I was hoping you might jump in on this thread.
>
> Please see below.
>
> On 10/9/19 10:46 AM, Eric W. Biederman wrote:
>> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:
>>
>>> Hello Philipp,
>>>
>>> My apologies that it has taken a while to reply. (I had been hoping
>>> and waiting that a few more people might weigh in on this thread.)
>>>
>>> On 9/23/19 3:42 PM, Philipp Wendler wrote:
>>>> Hello Michael,
>>>>
>>>> Am 23.09.19 um 14:04 schrieb Michael Kerrisk (man-pages):
>>>>
>>>>> I'm considering to rewrite these pieces to exactly
>>>>> describe what the system call does (which I already
>>>>> do in the third paragraph) and remove the "may or may not"
>>>>> pieces in the second paragraph. I'd welcome comments
>>>>> on making that change.
>
> What did you think about my proposal above? To put it in context,
> this was my initial comment in the mail:
>
> [[
> One area of the page that I'm still not really happy with
> is the "vague" wording in the second paragraph and the note
> in the third paragraph about the system call possibly
> changing. These pieces survive (in somewhat modified form)
> from the original page, which was written before the
> system call was released, and it seems there was some
> question about whether the system call might still change
> its behavior with respect to the root directory and current
> working directory of other processes. However, after 19
> years, nothing has changed, and surely it will not in the
> future, since that would constitute an ABI breakage.
> I'm considering to rewrite these pieces to exactly
> describe what the system call does (which I already
> do in the third paragraph) and remove the "may or may not"
> pieces in the second paragraph. I'd welcome comments
> on making that change.
> ]]
>
> And the second and third paragraphs of the manual page currently
> read:
>
> [[
> pivot_root() may or may not change the current root and the curâ
> rent working directory of any processes or threads that use the
> old root directory and which are in the same mount namespace as
> the caller of pivot_root(). The caller of pivot_root() should
> ensure that processes with root or current working directory at
> the old root operate correctly in either case. An easy way to
> ensure this is to change their root and current working directory
> to new_root before invoking pivot_root(). Note also that
> pivot_root() may or may not affect the calling process's current
> working directory. It is therefore recommended to call chdir("/")
> immediately after pivot_root().
>
> The paragraph above is intentionally vague because at the time
> when pivot_root() was first implemented, it was unclear whether
> its affect on other process's root and current working directoâ
> riesâand the caller's current working directoryâmight change in
> the future. However, the behavior has remained consistent since
> this system call was first implemented: pivot_root() changes the
> root directory and the current working directory of each process
> or thread in the same mount namespace to new_root if they point to
> the old root directory. (See also NOTES.) On the other hand,
> pivot_root() does not change the caller's current working direcâ
> tory (unless it is on the old root directory), and thus it should
> be followed by a chdir("/") call.
> ]]

Apologies I saw that concern I didn't realize it was a questio

I think it is very reasonable to remove warning the behavior might
change. We have pivot_root(8) in common use that to use it requires
the semantic of changing processes other than the current process.
Which means any attempt to noticably change the behavior of
pivot_root(2) will break userspace.

Now the documented semantics in behavior above are not quite what
pivot_root(2) does. It walks all processes on the system and if the
working directory or the root directory refer to the root mount that is
being replaced, then pivot_root(2) will update them.

In practice the above is limited to a mount namespace. But something as
simple as "cd /proc/<somepid>/root" can allow a process to have a
working directory in a different mount namespace.

Because ``unprivileged'' users can now use pivot_root(2) we may want to
rethink the implementation at some point to be cheaper than a global
process walk. So far that process walk has not been a problem in
practice.

If we had to write pivot_root(2) from scratch limiting it to just
changing the root directory of the process that calls pivot_root(2)
would have been the superior semantic. That would have required run
pivot_root(8) like: "exec pivot_root . . -- /bin/bash ..." but it would
not have required walking every thread in the system.

>>>> I think that it would make the man page significantly easier to
>>>> understand if if the vague wording and the meta discussion about it are
>>>> removed.
>>>
>>> It is my inclination to make this change, but I'd love to get more
>>> feedback on this point.
>>>
>>>>> DESCRIPTION
>>>> [...]> pivot_root() changes the
>>>>> root directory and the current working directory of each process
>>>>> or thread in the same mount namespace to new_root if they point to
>>>>> the old root directory. (See also NOTES.) On the other hand,
>>>>> pivot_root() does not change the caller's current working direcâ
>>>>> tory (unless it is on the old root directory), and thus it should
>>>>> be followed by a chdir("/") call.
>>>>
>>>> There is a contradiction here with the NOTES (cf. below).
>>>
>>> See below.
>>>
>>>>> The following restrictions apply:
>>>>>
>>>>> - new_root and put_old must be directories.
>>>>>
>>>>> - new_root and put_old must not be on the same filesystem as the
>>>>> current root. In particular, new_root can't be "/" (but can be
>>>>> a bind mounted directory on the current root filesystem).
>>>>
>>>> Wouldn't "must not be on the same mountpoint" or something similar be
>>>> more clear, at least for new_root? The note in parentheses indicates
>>>> that new_root can actually be on the same filesystem as the current
>>>> note. However, ...
>>>
>>> For 'put_old', it really is "filesystem".
>>
>> If we are going to be pedantic "filesystem" is really the wrong concept
>> here. The section about bind mount clarifies it, but I wonder if there
>> is a better term.
>
> Thanks. My aim was to try to distinguish "mount point" from
> "a mount somewhere inside the file system associated with a
> certain mount point"--in other words, I wanted to make it clear
> that 'put_old' (and 'new_root') could not be subdirectories
> under the current root mount point (which is correct, right?).
>
> Using "mount" does seem better. (My only concern is that some
> people may take it to mean "the mount point", but perhaps that
> just my own confusion.)

I am open to better terms. But mount or vfsmount is what we are using
internal to the kernel and is really a distinct concept from filesystem.
And it is starting to leak out in system calls like move_mount.

>> I think I would say: "new_root and put_old must not be on the same mount
>> as the current root."
>
> I've made that change.
>
>> I think using "mount" instead of "filesystem" keeps the concepts less
>> confusing.
>>
>> As I am reading through this email and seeing text that is trying to be
>> precise and clear then hitting the term "filesystem" is a bit jarring.
>> pivot_root doesn't care a thing for file systems. pivot_root only cares
>> about mounts.
>>
>> And by a "mount" I mean the thing that you get when you create a bind
>> mount or you call mount normally.
>
> Thanks for the above comments.
>
> Hmm, doI need to make similar changes in the initial paragraph of
> the manual page as well? It currently reads:
>
> pivot_root() changes the root filesystem in the mount namespace of
> the calling process. More precisely, it moves the root filesystem
> to the directory put_old and makes new_root the new root filesysâ
> tem. The calling process must have the CAP_SYS_ADMIN capability
> in the user namespace that owns the caller's mount namespace.
>
> Furthermore the one line NAME of the man page reads:
>
> pivot_root - change the root filesystem
>
> Is a change needed there also?

Yes please. Both locations.

>> Michael do you have man pages for the new mount api yet?
>
> David Howells wrote pages in mid-2018, well before the syscalls got
> merged in the kernel (in mid-2019). I did not merge them because
> the code was not yet in the kernel, and lacking time, I never chased
> David when the syscalls did get merged to see if the pages were still
> up to date. I pinged David just now.

Good. I was thinking of them because the concept of "mount" matters more
there.


>>>
>>>>> - put_old must be at or underneath new_root; that is, adding a
>>>>> nonnegative number of /.. to the string pointed to by put_old
>>>>> must yield the same directory as new_root.
>>>>>
>>>>> - new_root must be a mount point. (If it is not otherwise a
>>>>> mount point, it suffices to bind mount new_root on top of
>>>>> itself.)
>>>>
>>>> ... this item actually makes the above item almost redundant regarding
>>>> new_root (except for the "/") case. So one could replace this item with
>>>> something like this:
>>>>
>>>> - new_root must be a mount point different from "/". (If it is not
>>>> otherwise a mount point, it suffices to bind mount new_root on top
>>>> of itself.)
>>>>
>>>> The above item would then only mention put_old (and maybe use clarified
>>>> wording on whether actually a different file system is necessary for
>>>> put_old or whether a different mount point is enough).
>>>
>>> Thanks. That's a good suggestion. I simplified the earlier bullet
>>> point as you suggested, and changed the text here to say:
>>>
>>> - new_root must be a mount point, but can't be "/". If it is not
>>> otherwise a mount point, it suffices to bind mount new_root on
>>> top of itself. (new_root can be a bind mounted directory on
>>> the current root filesystem.)
>>
>> How about:
>> - new_root must be the path to a mount, but can't be "/". Any
>
> Surely here it must be "mount point" not "mount"? (See my discussion
> above.)

Sigh. I have had my head in the code to long, where new_root is
used to refer to the mount that is mounted on that mount point as well.


>
>> path that is not already a mount can be converted into one by
>> bind mounting the path onto itself.
>>>>> NOTES
>>>> [...]
>>>>> pivot_root() allows the caller to switch to a new root filesystem
>>>>> while at the same time placing the old root mount at a location
>>>>> under new_root from where it can subsequently be unmounted. (The
>>>>> fact that it moves all processes that have a root directory or
>>>>> current working directory on the old root filesystem to the new
>>>>> root filesystem frees the old root filesystem of users, allowing
>>>>> it to be unmounted more easily.)
>>>>
>>>> Here is the contradiction:
>>>> The DESCRIPTION says that root and current working dir are only changed
>>>> "if they point to the old root directory". Here in the NOTES it says
>>>> that any root or working directories on the old root file system (i.e.,
>>>> even if somewhere below the root) are changed.
>>>>
>>>> Which is correct?
>>>
>>> The first text is correct. I must have accidentally inserted
>>> "filesystem" into the paragraph just here during a global edit.
>>> Thanks for catching that.
>>>
>>>> If it indeed affects all processes with root and/or current working
>>>> directory below the old root, the text here does not clearly state what
>>>> the new root/current working directory of theses processes is.
>>>> E.g., if a process is at /foo and we pivot to /bar, will the process be
>>>> moved to /bar (i.e., at / after pivot_root), or will the kernel attempt
>>>> to move it to some location like /bar/foo? Because the latter might not
>>>> even exist, I suspect that everything is just moved to new_root, but
>>>> this could be stated explicitly by replacing "to the new root
>>>> filesystem" in the above paragraph with "to the new root directory"
>>>> (after checking whether this is true).
>>>
>>> The text here now reads:
>>>
>>> pivot_root() allows the caller to switch to a new root filesystem
>>> while at the same time placing the old root mount at a location
>>> under new_root from where it can subsequently be unmounted. (The
>>> fact that it moves all processes that have a root directory or
>>> current working directory on the old root directory to the new
>>> root frees the old root directory of users, allowing the old root
>>> filesystem to be unmounted more easily.)
>>
>>
>> Please "mount" instead of "filesystem".
>
> Changed.
>
>
>>>>> EXAMPLE> The program below demonstrates the use of pivot_root() inside a
>>>>> mount namespace that is created using clone(2). After pivoting to
>>>>> the root directory named in the program's first command-line arguâ
>>>>> ment, the child created by clone(2) then executes the program
>>>>> named in the remaining command-line arguments.
>>>>
>>>> Why not use the pivot_root(".", ".") in the example program?
>>>> It would make the example shorter, and also works if the process cannot
>>>> write to new_root (e..g., in a user namespace).
>>>
>>> I'm not sure. Some people have a bit of trouble to wrap their head
>>> around the pivot_root(".", ".") idea. (I possibly am one of them.)
>>> I'd be quite keen to hear other opinions on this. Unfortunately,
>>> few people have commented on this manual page rewrite.
>>
>> I am happy as long as it is pivot_root(".", ".") is documented
>> somewhere. There is real code that uses it so it is not going away.
>> Plus pivot_root(".", ".") is really what is desired in a lot of
>> situations where the caller of pivot_root is an intermediary and
>> does not control the new root filesystem. At which point the only
>> path you can be guaranteed to exit on the new root filesystem is "/".
>
> Good. There is documentation of pivot_root(".", ".") i the page!

Yeah!

Eric