Re: [PATCH 0/6] CLONE_FD: Task exit notification via file descriptor

From: Josh Triplett
Date: Fri Mar 13 2015 - 15:43:10 EST

On Fri, Mar 13, 2015 at 04:05:29PM +0000, David Drysdale wrote:
> On Fri, Mar 13, 2015 at 1:40 AM, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote:
> > This patch series introduces a new clone flag, CLONE_FD, which lets the caller
> > handle child process exit notification via a file descriptor rather than
> > SIGCHLD. CLONE_FD makes it possible for libraries to safely launch and manage
> > child processes on behalf of their caller, *without* taking over process-wide
> > SIGCHLD handling (either via signal handler or signalfd).
> Hi Josh,
> From the overall description (i.e. I haven't looked at the code yet)
> this looks very interesting. However, it seems to cover a lot of the
> same ground as the process descriptor feature that was added to FreeBSD
> in 9.x/10.x:


> I think it would ideally be nice for a userspace library developer to be
> able to do subprocess management (without SIGCHLD) in a similar way
> across both platforms, without lots of complicated autoconf shenanigans.
> So could we look at the overlap and seeing if we can come up with
> something that covers your requirements and also allows for something
> that looks like FreeBSD's process descriptors?

Agreed; however, I think it's reasonable to provide appropriate Linux
system calls, and then let glibc or libbsd or similar provide the
BSD-compatible calls on top of those. I don't think the kernel
interface needs to exactly match FreeBSD's, as long as it's a superset
of the functionality.

For example, pdfork can just call clone4 with CLONE_FD and return the
resulting file descriptor.

In my further comments below, I'll suggest ways that the FreeBSD library
calls could be implemented on top of Linux system calls.

> (I've actually got some rough patches to add process descriptor
> functionality on Linux, so I can look at how the two approaches compare
> and contrast.)
> > Note that signalfd for SIGCHLD does not suffice here, because that still
> > receives notification for all child processes, and interferes with process-wide
> > signal handling.
> >
> > The CLONE_FD file descriptor uniquely identifies a process on the system in a
> > race-free way, by holding a reference to the task_struct. In the future, we
> > may introduce APIs that support using process file descriptors instead of PIDs.
> FreeBSD has pdkill(2) and (theoretically) pdwait4(2) along these lines.
> I suspect we need either need pdkill(2) or a way to retrieve a PID from
> a process file descriptor, so that there's a way to send signals to the
> child.

The original caller of clone4 with CLONE_FD can pass CLONE_PARENT_SETTID
to get the PID.

In the future, I plan to add an fd-based equivalent of
rt_{,tg}sigqueueinfo (likely a single syscall with a flag to determine
whether to kill a process or thread) which is a superset of pdkill.
pdkill could then call that and just not pass the extra info.

A fair bit of pdwait4 could be implemented on top of read(), other than
the full rusage information (see below), and the ability to wait for
STOP/CONT (which the CLONE_FD file descriptor could support if desired,
but it'd have to be set via a flag at clone time).

I think it's a feature to use read() rather than an additional magic
system call.

> > Introducing CLONE_FD required two additional bits of yak shaving: Since clone
> > has no more usable flags (with the three currently unused flags unusable
> > because old kernels ignore them without EINVAL), also introduce a new clone4
> > system call with more flag bits and an extensible argument structure. And
> > since the magic pt_regs-based syscall argument processing for clone's tls
> > argument would otherwise prevent introducing a sane clone4 system call, fix
> > that too.
> >
> > I tested the CLONE_SETTLS changes with a thread-local storage test program (two
> > threads independently reading and writing a __thread variable), on both 32-bit
> > and 64-bit, and I observed no issues there.
> Worth preserving in tools/testing/selftests/ ?

Not really; it's just the following trivial program, which was faster to
write than to attempt to find somewhere:

#include <pthread.h>
#include <stdio.h>

__thread unsigned x = 0;

void *thread_func(void *unused)
unsigned *tx = &x;
for (; *tx < 10; (*tx)++)
printf("child: tx=%p *tx=%u\n", tx, *tx);
return NULL;

int main(void)
unsigned *tx = &x;
pthread_t thread;
pthread_create(&thread, NULL, thread_func, NULL);
for (; *tx < 10; (*tx)++)
printf("main: tx=%p *tx=%u\n", tx, *tx);
pthread_join(thread, NULL);
return 0;

(I didn't bother with error handling, because I ran it under strace.)

> > I tested clone4 and the new CLONE_FD call with several additional test
> > programs, launching either a process or thread (in the former case using
> > syscall(), in the latter case by calling clone4 via assembly and returning to
> > C), sleeping in parent and child to test the case of either exiting first, and
> > then printing the received clone4_info structure. Thiago also tested clone4
> > with CLONE_FD with a modified version of libqt's process handling, which
> > includes a test suite.
> >
> > I've also included the manpages patch at the end of this series. (Note that
> > the manpage documents the behavior of the future glibc wrapper as well as the
> > raw syscall.) Here's a formatted plain-text version of the manpage for
> > reference:
> FYI, I've added some comparisons with the FreeBSD equivalents below.


> > CLONE4(2) Linux Programmer's Manual CLONE4(2)
> >
> >
> >
> > NAME
> > clone4 - create a child process
> >
> > /* Prototype for the glibc wrapper function */
> >
> > #define _GNU_SOURCE
> > #include <sched.h>
> >
> > int clone4(uint64_t flags,
> > size_t args_size,
> > struct clone4_args *args,
> > int (*fn)(void *), void *arg);
> >
> > /* Prototype for the raw system call */
> >
> > int clone4(unsigned flags_high, unsigned flags_low,
> > unsigned long args_size,
> > struct clone4_args *args);
> >
> > struct clone4_args {
> > pid_t *ptid;
> > pid_t *ctid;
> > unsigned long stack_start;
> > unsigned long stack_size;
> > unsigned long tls;
> > };
> >
> >
> > clone4() creates a new process, similar to clone(2) and fork(2).
> > clone4() supports additional flags that clone(2) does not, and accepts
> > arguments via an extensible structure.
> >
> > args points to a clone4_args structure, and args_size must contain the
> > size of that structure, as understood by the caller. If the caller
> > passes a shorter structure than the kernel expects, the remaining
> > fields will default to 0. If the caller passes a larger structure than
> > the kernel expects (such as one from a newer kernel), clone4() will
> > return EINVAL. The clone4_args structure may gain additional fields at
> > the end in the future, and callers must only pass a size that encomâ
> > passes the number of fields they understand. If the caller passes 0
> > for args_size, args is ignored and may be NULL.
> >
> > In the clone4_args structure, ptid, ctid, stack_start, stack_size, and
> > tls have the same semantics as they do with clone(2) and clone2(2).
> >
> > In the glibc wrapper, fn and arg have the same semantics as they do
> > with clone(2). As with clone(2), the underlying system call works more
> > like fork(2), returning 0 in the child process; the glibc wrapper simâ
> > plifies thread execution by calling fn(arg) and exiting the child when
> > that function exits.
> >
> > The 64-bit flags argument (split into the 32-bit flags_high and
> > flags_low arguments in the kernel interface) accepts all the same flags
> > as clone(2), with the exception of the obsolete CLONE_PID,
> > CLONE_DETACHED, and CLONE_STOPPED. In addition, flags accepts the folâ
> > lowing flags:
> >
> >
> > Instead of returning a process ID, clone4() with the CLONE_FD
> > flag returns a file descriptor associated with the new process.
> > When the new process exits, the kernel will not send a signal to
> > the parent process, and will not keep the new process around as
> > a "zombie" process until a call to waitpid(2) or similar.
> > Instead, the file descriptor will become available for reading,
> > and the new process will be immediately reaped.
> Just to confirm: presumably a waitpid(-1,...) call that's already in
> progress won't return when one of these child processes exits?

I agree, I don't think it should. Because otherwise you'd also assume
you can waitpid() on the PID itself, and that'd be a race condition
since the process autoreaps.

> > Unlike using signalfd(2) for the SIGCHLD signal, the file
> > descriptor returned by clone4() with the CLONE_FD flag works
> > even with SIGCHLD unblocked in one or more threads of the parent
> > process, and allows the process to have different handlers for
> > different child processes, such as those created by a library,
> > without introducing race conditions around process-wide signal
> > handling.
> >
> > clone4() will never return a file descriptor in the range 0-2 to
> > the caller, to avoid ambiguity with the return of 0 in the child
> > process. Only the calling process will have the new file
> > descriptor open; the child process will not.
> FreeBSD's pdfork(2) returns a PID but also takes an int *fdp argument to
> return the file descriptor separately, which avoids the need for special
> case processing for low FD values (and means that POSIX's "lowest file
> descriptor not currently open" behaviour can be preserved if desired).

That'd be easy to implement if desired, by adding an outbound pointer to

The (very mild) reason I'd dropped the PID: with CLONE_FD and future
syscalls that use the fd as an identifier, PIDs can hopefully become
mostly unnecessary. However, I'm not that attached to changing the
return value; it'd be trivial to switch to an outbound parameter
instead, and then drop the "not 0-2".

> > Since the kernel does not send a termination signal when a child
> > process created with CLONE_FD exits, the low byte of flags does
> > not contain a signal number. Instead, the low byte of flags can
> > contain the following additional flags for use with CLONE_FD:
> >
> >
> > Set the O_CLOEXEC flag on the new open file descriptor.
> > See the description of the O_CLOEXEC flag in open(2) for
> > reasons why this may be useful.
> >
> >
> > Set the O_NONBLOCK flag on the new open file descriptor.
> > Using this flag saves extra calls to fcntl(2) to achieve
> > the same result.
> >
> >
> > clone4() with the CLONE_FD flag returns a file descriptor that
> > supports the following operations:
> >
> > read(2) (and similar)
> > When the new process exits, reading from the file
> > descriptor produces a single clonefd_info structure:
> >
> > struct clonefd_info {
> > uint32_t code; /* Signal code */
> > uint32_t status; /* Exit status or signal */
> > uint64_t utime; /* User CPU time */
> > uint64_t stime; /* System CPU time */
> > };
> Presumably there is no way to get full rusage information for the exited
> process?

I focused on the information available via SIGCHLD. Even utime and
stime are unnecessary for the primary use case of CLONE_FD, but I
included them because SIGCHLD does. I'd like to avoid sending the much
larger rusage over the file descriptor when the caller may not care.

However, given that the task_struct sticks around as long as the
CLONE_FD file descriptor does, if that information is normally still
available from a dead-but-not-waited-on process, it should be trivial to
add an operation that takes the file descriptor and returns the full
rusage, if someone needs that. I think that can be done as part of a
later patch series adding other operations for use with the file
descriptor, though.

> [FreeBSD theoretically has pdwait4(2) to do wait4-like operations on a
> process descriptor, including rusage retrieval. However, I don't think
> they actually implemented it:

That's a pretty good argument that we don't need to either, at least not

> > If the new process has not yet exited, read(2) either
> > blocks until it does, or fails with the error EAGAIN if
> > the file descriptor has been made nonblocking.
> >
> > Future kernels may extend clonefd_info by appending addiâ
> > tional fields to the end. Callers should read as many
> > bytes as they understand; unread data will be discarded,
> > and subsequent reads after the first will return 0 to
> > indicate end-of-file. Callers requesting more bytes than
> > the kernel provides (such as callers expecting a newer
> > clonefd_info structure) will receive a shorter structure
> > from older kernels.
> FreeBSD also implements fstat(2) for its process descriptors, although
> only a few of the fields get filled in.

I looked at what they provide, and that seems like more of a novelty
than something particularly useful (since most of the stat fields aren't
meaningful), but if that's useful for compatibility then adding it seems

> > poll(2), select(2), epoll(7) (and similar)
> > The file descriptor is readable (the select(2) readfds
> > argument; the poll(2) POLLIN flag) if the new process has
> > exited.
> FreeBSD uses POLLHUP here.

That makes sense given that they provide the information via a separate
call rather than read. Since the CLONE_FD file descriptor uses read, it
needs to provide POLLIN, but I have no objection to using *both* POLLIN
and POLLHUP if that'd be at all useful.

> > close(2)
> > When the file descriptor is no longer required it should
> > be closed. If no process has a file descriptor open for
> > the new process, no process will receive any notification
> > when the new process exits. The new process will still
> > be immediately reaped.
> FreeBSD has two different behaviours for close(2), depending on a flag
> value (PD_DAEMON). With the flag set it's roughly like this, but
> without PD_DAEMON a close(2) operation on the (last open) file
> descriptor terminates the child process.
> This can be quite useful, particularly for the use case where some
> userspace library has an FD-controlled subprocess -- if the application
> using the library terminates, the process descriptor is closed and so
> the subprocess is automatically terminated.

That's an interesting idea. I don't think it makes sense for that to be
the default behavior, but if someone wanted to add an additional flag
to implement that behavior, that seems fine. A FreeBSD-compatible
pdfork could then use that flag when not passed PD_DAEMON and not use it
when passed PD_DAEMON.

How does it kill the process when the last open descriptor closes?
SIGKILL? SIGTERM? The former seems unfriendly (preventing graceful
termination), and the latter blockable. There's a reason init systems
send TERM, then wait, then KILL.

- Josh Triplett
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at