Re: [PATCH] splice support #2

From: Bernd Petrovitsch
Date: Fri Mar 31 2006 - 07:44:15 EST


On Thu, 2006-03-30 at 13:16 -0800, Linus Torvalds wrote:
[...]
> splice() really can handle any fd->fd move.
>
> The reason you want to have a pipe in the middle is that you have to
> handle partial moves _some_ way. And the pipe being the buffer really does
> allow that, and also handles the case of "what happens when we received
> more data than we could write".
>
> So the way to do copies is
>
> int pipefd[2];
> unsigned long copied = 0;
>
> if (pipe(pipefd) < 0)
> error
>
> for (;;) {
> int nr = splice(in, pipefd[1], MAX_INT, 0);
> if (nr <= 0)
> break;
> do {
> int ret = splice(pipefd[0], out, nr, 0);
> if (ret <= 0) {
> error: couldn't write 'nr' bytes
> break;
> }
>
> nr -= ret;
> } while (nr);
> }
> close(pipefd[0]);
> close(pipefd[1]);
>
> which may _seem_ very complicated and the extra pipe seems "useless", but
> it's (a) actually pretty efficient and (b) allows error recovery.
>
> That (b) is the really important part. I can pretty much guarantee that
> without the "useless" pipe, you simply couldn't do it.
>
> In particular, what happens when you try to connect two streaming devices,
> but the destination stops accepting data? You cannot put the received data
> "back" into the streaming source any way - so if you actually want to be
> able to handle error recovery, you _have_ to get access to the source
> buffers.
>
> Also, for signal handling, you need to have some way to keep the pipe
> around for several iterations on the sender side, while still returning to
> user space to do the signal handler.
>
> And guess what? That's exactly what you get with that "useless" pipe. For
> error handling, you can decide to throw the extra data that the recipient
> didn't want away (just close the pipe), and maybe that's going to be what
> most people do. But you could also decide to just do a "read()" on the
> pipefd, to just read the data into user space for debugging and/or error
> recovery..
>
> Similarly, for signals, the pipe _is_ that buffer that you need that is
> consistent across multiple invocations.
>
> So that pipe is not at all unnecessary, and in fact, it's critical. It may

IOW the "pipe" above is in fact just a "handle" of a kernel-internal
buffer (of one page) and the two fd's can be used to tell sys-calls to
read from or write into the buffer. But it is much faster because it
avoids real copying of the data into a user-space buffer and out of the
user-space buffer since the kernel can manipulate the underlying data
directly as a typical implementation of the above code does (i.e. use
read(2) and write(2) instead of the two splice(2) calls).
Uuuuuh, real zero-copy seems possible.

Bernd
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/