[REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang

From: Alex Xu (Hello71)
Date: Wed Aug 04 2021 - 11:37:50 EST


Hi,

An issue "Jobserver hangs due to full pipe" was recently reported
against Cargo, the Rust package manager. This was diagnosed as an issue
with pipe writes hanging in certain circumstances.

Specifically, if two or more threads simultaneously write to a pipe, it
is possible for all the writers to hang despite there being significant
space available in the pipe.

I have translated the Rust example to C with some small adjustments:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
char buf[1];
for (int i = 0; i < 1000000; i++) {
read(pipefd[0], buf, sizeof(buf));
write(pipefd[1], buf, sizeof(buf));
}
puts("done");
return NULL;
}

int main() {
pipe(pipefd);
printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
write(pipefd[1], "aa", 2);
pthread_t thread1, thread2;
pthread_create(&thread1, NULL, thread_start, NULL);
pthread_create(&thread2, NULL, thread_start, NULL);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
}

The expected behavior of this program is to print:

init buffer: 65536
new buffer: 4096
done
done

and then exit.

On Linux 5.14-rc4, compiling this program and running it will print the
following about half the time:

init buffer: 65536
new buffer: 4096
done

and then hang. This is unexpected behavior, since the pipe is at most
two bytes full at any given time.

/proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
It looks like not only there needs to be space in the pipe, but also
slots. At pipe.c:1306, a one-page pipe has only one slot. this led me to
test nthreads=2, which also hangs. Checking blame of the pipe_write
comment, it was added in a194dfe, which says, among other things:

> We just abandon the preallocated slot if we get a copy error. Future
> writes may continue it and a future read will eventually recycle it.

This matches the observed behavior: in this case, there are no readers
on the pipe, so the abandoned slot is lost.

In my opinion (as expressed on the issue), the pipe is being misused
here. As explained in the pipe(7) manual page:

> Applications should not rely on a particular capacity: an application
> should be designed so that a reading process consumes data as soon as
> it is available, so that a writing process does not remain blocked.

Despite the misuse, I am reporting this for the following reasons:

1. I am reasonably confident that this is a regression in the kernel,
which has a standard of making reasonable efforts to maintain
backwards compatibility even with broken programs.

2. Even if this is not a regression, it seems like this situation could
be handled somewhat more gracefully. In this case, we are not writing
4095 bytes and then expecting a one-byte write to succeed; the pipe
is actually almost entirely empty.

3. Pipe sizes dynamically shrink in Linux, so despite the fact that this
case is unlikely to occur with two or more slots available, even a
program which does not explicitly allocate a one-page pipe buffer may
wind up with one if the user has 1024 or more pipes already open.
This significantly exacerbates the next point:

4. GNU make's jobserver uses pipes in a similar manner. By my reading of
the paper, it is theoretically possible for an N simultaneous writes
to occur without any readers, where N is the maximum concurrent jobs
permitted.

Consider the following example with make -j2: two compile jobs are to
be performed: one at the top level, and one in a sub-directory. The
top-level make invokes one make and one cc, costing two tokens. The
sub-make invokes one cc with its free token. The pipe is now empty.
Now, suppose the two compilers return at exactly the same time. Both
copies of make will attempt to simultaneously write a token to the
pipe. This does not yet trigger deadlock: at least one write will
always succeed on an empty pipe. Suppose the sub-make's write goes
through. It then exits. The top-level make, however, is still blocked
on its original write, since it was not successfully merged with the
other write. The build is now deadlocked.

I think this does not happen only by a coincidental design decision:
when the sub-make exits, the top-level make receives a SIGCHLD. GNU
make registers a SA_RESTART handler for SIGCHLD, so the write will be
interrupted and restarted. This is only a coincidence, however: the
program does not actually expect writing to the control pipe to ever
block; it could just as well de-register the signal handler while
performing the write and still be fully correct.

Regards,
Alex.