Re: vfork(2) behavior not consistent with fork(2) (was: vfork(2) fails after unshare(CLONE_NEWTIME) (was: [Bug 215769] man 2 vfork() does not document corner case when PID == 1))

From: Christian Brauner
Date: Wed Apr 06 2022 - 07:47:20 EST


On Tue, Apr 05, 2022 at 09:28:12PM +0200, Alejandro Colomar wrote:
> Hey, Christian!
>
> On 4/4/22 10:05, Christian Brauner wrote:
> > On Sat, Apr 02, 2022 at 11:15:52PM +0200, Alejandro Colomar (man-pages) wrote:
> > > [Added some kernel CCs that may know what's going on]
> [...]
> > > Maybe someone in the kernel can send some patch for the clone(2) and/or
> > > vfork(2) manual pages that explains the reason (if it's intended).
> >
> > Hey Alejandro,
> >
> > I won't be able to send a patch very soon but I can at least explain why
> > you see EINVAL. :)
>
> Don't hurry, we're not planning to release any soon :)
>
> >
> > This is intended.
> >
> > vfork() suspends the parent process and the child process will share the
> > same vm as the parent process. If the child process is in a new time
> > namespace different from its parent process it is not allowed to be in
> > the same threadgroup or share virtual memory with the parent process.
> > That's why you see EINVAL.
>
> That makes a lot of sense to me.
>
> >
> > Note, the unshare(CLONE_NEWTIME) call will _not_ cause the calling
> > process to be moved into a different time namespace. Only the newly
> > created child process will be after a subsequent
> > fork()/vfork()/clone()/clone3()...
> >
> > The semantics are equivalent to that of CLONE_NEWPID in this regard. You
> > can see this via /proc/<pid>/ns/ where you see two entries for pid
> > namespaces and also two entries for time namespaces:
> >
> > * CLONE_NEWTIME
> > * /proc/<pid>/ns/time // current time namespace
> > * /proc/<pid>/ns/time_for_children // time namespace for the new child process
>
> Also makes sense. Michael taught me that a few weeks ago :)
>
> This also triggers some doubt: will the same problem happen with
> CLONE_NEWPID since it also moves the child into a new ns (in this case a PID
> one)? See test program below.

No, it won't. A pid namespace places no relevant constraints on vm usage
whereas a time namespace does.
If a task joins a new time namespace it'll clean the VVAR page tables
and refault them with the new layout after the timens change. That
affects all tasks which use the same task->mm.

Since CLONE_THREAD implies CLONE_VM this would affect the whole
thread-group behind their back. All threads would suddenly change
timens.

No such issues exist for pid namespaces; they don't need to alter
task->mm.

>
> >
> > If during fork:
> >
> > parent_process->time != parent_process->time_for_children
> >
> > and either CLONE_VM or CLONE_THREAD is set you see EINVAL.
> >
> > You can thus replicate the same error via:
> >
> > unshare(CLONE_NEWTIME)
> >
> > and a
> >
> > clone() or clone3() call with CLONE_VM or CLONE_THREAD.
>
> So, to test my doubts, I wrote this similar program (and also similar
> programs where only the CLONE_NEW* flag was changed, one with CLONE_NEWTIME,
> and one with CLONE_NEWNS)):
>
> $ cat vfork_newpid.c
> #define _GNU_SOURCE
> #include <err.h>
> #include <errno.h>
> #include <linux/sched.h>
> #include <sched.h>
> #include <signal.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/syscall.h>
> #include <unistd.h>
>
> static char *const child_argv[] = {
> "print_pid",
> NULL
> };
>
> static char *const child_envp[] = {
> NULL
> };
>
> int
> main(void)
> {
> pid_t pid;
>
> printf("%s: PID: %ld\n", program_invocation_short_name, (long) getpid());
>
> if (unshare(CLONE_NEWPID) == -1)
> err(EXIT_FAILURE, "unshare(2)");
> if (signal(SIGCHLD, SIG_IGN) == SIG_ERR)
> err(EXIT_FAILURE, "signal(2)");
>
> pid = syscall(SYS_vfork);
> //pid = vfork(); // This behaves differently.
> switch (pid) {
> case 0:
> execve("/home/alx/tmp/print_pid", child_argv, child_envp);
> err(EXIT_SUCCESS, "PID %jd exiting after execve(2)",
> (long) getpid());
> case -1:
> err(EXIT_FAILURE, "vfork(2)");
> default:
> errx(EXIT_SUCCESS, "Parent exiting after vfork(2).");
> }
> }
>
> $ cat print_pid.c
> #include <err.h>
> #include <stdlib.h>
> #include <unistd.h>
>
> int
> main(void)
> {
> errx(EXIT_SUCCESS, "PID %jd exiting.", (long) getpid());
> }
>
> $ cc -Wall -Wextra -Werror -o print_pid print_pid.c
> $ cc -Wall -Wextra -Werror -o vfork_newpid vfork_newpid.c
> $
> $
> $ sudo ./vfork_newpid
> vfork_newpid: PID: 8479
> vfork_newpid: PID 8479 exiting after execve(2): Success
> print_pid: PID 1 exiting.
> $
> $
> $ sudo ./vfork_newtime
> vfork_newtime: PID: 8484
> vfork_newtime: vfork(2): Invalid argument
> $
> $
> $ sudo ./vfork_newns
> vfork_newns: PID: 8486
> vfork_newns: PID 8486 exiting after execve(2): Success
> print_pid: PID 8487 exiting.
>
>
> The first thing I noted is that usage of vfork(2) differs considerably from
> fork(2), and that's something that's not clear by reading the manual page.
> It sais that the parent process is suspended until the child calls
> execve(2), but I expected it to mean that vfork(2) doesn't return to the
> parent until that happened, but was otherwise transparent. I was wrong and
> my tests showed me that.
>
> I was going to propose an example program for the manual page, when I
> decided to try a slightly different thing: call vfork() instead of
> syscall(SYS_vfork); that changed the behavior to the same one as with
> fork(2) (i.e., the parent resumes after vfork(2) returns the PID of the
> child.
>
> Is that also intended? I couldn't find the glibc wrapper source code, so I
> don't know what is glibc doing here, but I straced the processes, and
> they're all calling vfork(), so the behavior should be consistent; it's
> quite weird. I'm very confused at this point.

glibc does vfork() via inline assembly massaging. There's probably
atfork handlers and a bunch of other stuff involved so it's difficult to
do a remote diagnosis.
(And note that calling anything other than execve() or _exit() after
vfork() is basically undefined behavior.)

>
>
> I'm also wondering why it's okay to have processes in different PID ns share
> the same vm, but I guess that's implementation details that I don't need to
> care that much.

See earlier in the thread.