Re: For review: pid_namespaces(7) man page

From: Michael Kerrisk (man-pages)
Date: Fri Mar 01 2013 - 03:50:42 EST


Hi Eric,

On Thu, Feb 28, 2013 at 4:24 PM, Eric W. Biederman
<ebiederm@xxxxxxxxxxxx> wrote:
> "Michael Kerrisk (man-pages)" <mtk.manpages@xxxxxxxxx> writes:

[...]

>> ==========
>> PID_NAMESPACES(7) Linux Programmer's Manual PID_NAMESPACES(7)
>>
>> NAME
>> pid_namespaces - overview of Linux PID namespaces
>>
>> DESCRIPTION
[...]

>> The namespace init process
>> The first process created in a new namespace (i.e., the process
>> created using clone(2) with the CLONE_NEWPID flag, or the first
>> child created by a process after a call to unshare(2) using the
>> CLONE_NEWPID flag) has the PID 1, and is the "init" process for
>> the namespace (see init(1)). Children that are orphaned within
>> the namespace will be reparented to this process rather than
>> init(1).
>>
>> If the "init" process of a PID namespace terminates, the kernel
>> terminates all of the processes in the namespace via a SIGKILL
>> signal. This behavior reflects the fact that the "init"
>> process is essential for the correct operation of a PID namesâ
>> pace. In this case, a subsequent fork(2) into this PID namesâ
>> pace (e.g., from a process that has done a setns(2) into the
>> namespace using an open file descriptor for a
>> /proc/[pid]/ns/pid file corresponding to a process that was in
>> the namespace) will fail with the error ENOMEM; it is not posâ
>> sible to create a new processes in a PID namespace whose "init"
>> process has terminated.
>
> It may be useful to mention unshare in the case of fork(2) failing just
> because that is such an easy mistake to make.
>
> unshare(CLONE_NEWPID);
> pid = fork();
> waitpid(pid,...);
> fork() -> ENOMEM

I'm lost. Why does that sequence fail? The child of fork() becomes PID
1 in the new PID namespace.

>> Only signals for which the "init" process has established a
>> signal handler can be sent to the "init" process by other memâ
>> bers of the PID namespace. This restriction applies even to
>> privileged processes, and prevents other members of the PID
>> namespace from accidentally killing the "init" process.
>>
>> Likewise, a process in an ancestor namespace canâsubject to the
>> usual permission checks described in kill(2)âsend signals to
>> the "init" process of a child PID namespace only if the "init"
>> process has established a handler for that signal. (Within the
>> handler, the siginfo_t si_pid field described in sigaction(2)
>> will be zero.) SIGKILL or SIGSTOP are treated exceptionally:
>> these signals are forcibly delivered when sent from an ancestor
>> PID namespace. Neither of these signals can be caught by the
>> "init" process, and so will result in the usual actions associâ
>> ated with those signals (respectively, terminating and stopping
>> the process).
>>
>> Nesting PID namespaces
>> PID namespaces can be nested: each PID namespace has a parent,
>> except for the initial ("root") PID namespace. The parent of a
>> PID namespace is the PID namespace of the process that created
>> the namespace using clone(2) or unshare(2). PID namespaces
>> thus form a tree, with all namespaces ultimately tracing their
>> ancestry to the root namespace.
>>
>> A process is visible to other processes in its PID namespace,
>> and to the processes in each direct ancestor PID namespace
>> going back to the root PID namespace. In this context, "visiâ
>> ble" means that one process can be the target of operations by
>> another process using system calls that specify a process ID.
>> Conversely, the processes in a child PID namespace can't see
>> processes in the parent and further removed ancestor namespace.
>> More succinctly: a process can see (e.g., send signals with
>> kill(2), set nice values with setpriority(2), etc.) only proâ
>> cesses contained in its own PID namespace and in descendants of
>> that namespace.
>>
>> A process has one process ID in each of the layers of the PID
>> namespace hierarchy in which is visible, and walking back
>> though each direct ancestor namespace through to the root PID
>> namespace. System calls that operate on process IDs always
>> operate using the process ID that is visible in the PID namesâ
>> pace of the caller. A call to getpid(2) always returns the PID
>> associated with the namespace in which the process was created.
>>
>> Some processes in a PID namespace may have parents that are
>> outside of the namespace. For example, the parent of the iniâ
>> tial process in the namespace (i.e., the init(1) process with
>> PID 1) is necessarily in another namespace. Likewise, the
>> direct children of a process that uses setns(2) to cause its
>> children to join a PID namespace are in a different PID namesâ
>> pace from the caller of setns(2). Calls to getppid(2) for such
>> processes return 0.
>>
>> setns(2) and unshare(2) semantics
>> Calls to setns(2) that specify a PID namespace file descriptor
>> and calls to unshare(2) with the CLONE_NEWPID flag cause chilâ
>> dren subsequently created by the caller to be placed in a difâ
>> ferent PID namespace from the caller. These calls do not, howâ
>> ever, change the PID namespace of the calling process, because
>> doing so would change the caller's idea of its own PID (as
>> reported by getpid()), which would break many applications and
>> libraries.
>>
>> To put things another way: a process's PID namespace membership
>> is determined when the process is created and cannot be changed
>> thereafter. Among other things, this means that the parental
>> relationship between processes mirrors the parental between PID
>> namespaces: the parent of a process is either in the same
>> namespace or resides in the immediate parent PID namespace.
>
> This is mostly true. With setns it is possible to have a parent
> in a pid namespace several steps up the pid namespace hierarchy.
>
>> Every thread in a process must be in the same PID namespace.
>> For this reason, the two following call sequences will fail:
>>
>> unshare(CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>>
>> setns(fd, CLONE_NEWPID);
>> clone(..., CLONE_VM, ...); /* Fails */
>>
>> Because the above unshare(2) and setns(2) calls only change the
>> PID namespace for created children, the clone(2) calls necesâ
>> sarily put the new thread in a different PID namespace from the
>> calling thread.
>
> I don't know if it is interesting but these sequences also fail. But I
> suppose that is obvious? Or documented at least Documented in the clone
> manpage and unshare manpages.
>
> clone(..., CLONE_VM, ...);
> unshare(CLONE_NEWPID); /* Fails */
>
> clone(..., CLONE_VM, ...);
> setns(fd, CLONE_NEWPID); /* Fails */


I added to this page.

>> Miscellaneous
>> After creating a new PID namespace, it is useful for the child
>> to change its root directory and mount a new procfs instance at
>> /proc so that tools such as ps(1) work correctly. (If a new
>> mount namespace is simultaneously created by including
>> CLONE_NEWNS in the flags argument of clone(2) or unshare(2)),
>> then it isn't necessary to change the root directory: a new
>> procfs instance can be mounted directly over /proc.)
>
> Should it be documented somewhere that /proc when mounted from a pid
> namespace will use the pids of that pid namespace and /proc will only
> show process for visible in the mounting pid namespace, even if that
> mount of proc is accessed by processes in other pid namespaces?
>
> You sort of say it here by saying it is useful to mount a new copy of
> /proc, which it is. I just don't see you coming out straight and saying
> why it is. It just seems to be implied.

You're right. I should be more explicit. I will add some text detailing this.

[...]

Thanks for the comments, Eric!

Cheers,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/