Re: [REGRESSION?] Simultaneous writes to a reader-less, non-full pipe can hang

From: Alex Xu (Hello71)
Date: Wed Aug 04 2021 - 15:49:31 EST

Excerpts from Linus Torvalds's message of August 4, 2021 12:31 pm:
> Your program is buggy.
> On Wed, Aug 4, 2021 at 8:37 AM Alex Xu (Hello71) <alex_y_xu@xxxxxxxx> wrote:
>> pipe(pipefd);
>> printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
>> printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
> Yeah, what did you expect this to do? You said you want a minimal
> buffer, you get a really small buffer.
> Then you try to write multiple messages to the pipe that you just said
> should have a minimum size.
> Don't do that then.
>> /proc/x/stack shows that the remaining thread is hanging at pipe.c:560.
>> It looks like not only there needs to be space in the pipe, but also
>> slots.
> Correct. The fullness of a pipe is not about whether it has the
> possibility of merging more bytes into an existing not-full slot, but
> about whether it has empty slots left.
> Part of that is simply the POSIX pipe guarantees - a write of size
> PIPE_BUF or less is guaranteed to be atomic, so it mustn't be split
> among buffers.
> So a pipe must not be "writable" unless it has space for at least that
> much (think select/poll, which don't know the size of the write).
> The fact that we might be able to reuse a partially filled buffer for
> smaller writes is simply not relevant to that issue.
> And yes, we could have different measures of "could write" for
> different writes, but we just don't have or want that complexity.
> Please don't mess with F_SETPIPE_SZ unless you have a really good
> reason to do so, and actually understand what you are doing.
> Doing a F_SETPIPE_SZ, 0 basically means "I want the mimimum pipe size
> possible". And that one accepts exactly one write at a time.
> Of course, the exact semantics are much more complicated than that
> "exactly one write". The pipe write code will optimistically merge
> writes into a previous buffer, which means that depending on the
> pattern of your writes, the exact number of bytes you can write will
> be very different.
> But that "merge writes into a previous buffer" only appends to the
> buffer - not _reuse_ it - so when each buffer is one page in size,
> what happens is that you can merge up to 4096 bytes worth of writes,
> but then after that the pipe write will want a new buffer - even if
> the old buffer is now empty because of old reads.
> That's why your test program won't block immediately: both writers
> will actually start out happily doing writes into that one buffer that
> is allocated, but at some point that buffer ends, and it wants to
> allocate a new buffer.
> But you told it not to allocate more buffers, and the old buffer is
> never completely empty because your readers never read _everythign_,
> so it will hang, waiting for you to empty the one minimal buffer it
> allocated. And that will never happen.
> There's a very real reason why we do *not* by default say "pipes can
> only ever use only one buffer".
> I don't think this is a regression, but if you have an actual
> application - not a test program - that does crazy things like this
> and used to work (I'm not sure it has ever worked, though), we can
> look into making it work again.
> That said, I suspect the way to make it work is to just say "the
> minimum pipe size is two slots" rather than change the "we want at
> least one empty slot". Exactly because of that whole "look, we must
> not consider a pipe that doesn't have a slot writable".
> Because clearly people don't understand how subtle F_SETPIPE_SZ is.
> It's not really a "byte count", even though that is how it's
> expressed.
> Linus

I agree that if this only affects programs which intentionally adjust
the pipe buffer size, then it is not a huge issue. The problem,
admittedly buried very close to the bottom of my email, is that the
kernel will silently provide one-page pipe buffers if the user has
exceeded 16384 (by default) pipe buffer pages allocated. Try this
slightly more complicated program:

#define _GNU_SOURCE
#include <fcntl.h>
#include <pthread.h>
#include <signal.h>
#include <stdio.h>
#include <unistd.h>

static int pipefd[2];

void *thread_start(void *arg) {
char buf[1];
int i;
for (i = 0; i < 1000000; i++) {
read(pipefd[0], buf, sizeof(buf));
if (write(pipefd[1], buf, sizeof(buf)) == -1)
printf("done %d\n", i);
return NULL;

int main() {
// pretend there are a very large number of make processes running
for (int i = 0; i < 1025; i++)
write(pipefd[1], "aa", 2);
printf("%d %d\n", pipefd[0], pipefd[1]);
printf("init buffer: %d\n", fcntl(pipefd[1], F_GETPIPE_SZ));
//printf("new buffer: %d\n", fcntl(pipefd[1], F_SETPIPE_SZ, 0));
pthread_t thread1, thread2;
pthread_create(&thread1, NULL, thread_start, NULL);
pthread_create(&thread2, NULL, thread_start, NULL);
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);

You may need to raise your RLIMIT_NOFILE before running the program.

With default settings (fs.pipe-user-pages-soft = 16384) the init buffer
will be 4096, no mangling required. I think this could be probably be
solved if the kernel instead reduced over-quota pipes to two pages
instead of one page. If someone wants to set a one-page pipe buffer,
then they can suffer the consequences, but the kernel shouldn't silently
hand people that footgun.