Re: copy on write for splice() from file to pipe?

From: Andy Lutomirski
Date: Fri Feb 10 2023 - 12:57:39 EST


On Fri, Feb 10, 2023 at 8:34 AM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Fri, Feb 10, 2023 at 7:15 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >
> > Frankly, I really don't like having non-immutable data in a pipe.
>
> That statement is completely nonsensical.

I know what splice() is. I'm trying to make the point that it may not
be the right API for most (all?) of its use cases, that we can maybe
do better, and that we should maybe even consider deprecating (and
simplifying and the cost of performance) splice in the moderately near
future. And I think I agree with you on most of what you're saying.

> It was literally designed to be "look, we want zero-copy networking,
> and we could do 'sendfile()' by mmap'ing the file, but mmap - and
> particularly munmap - is too expensive, so we map things into kernel
> buffers instead".

Indeed. mmap() + sendfile() + munmap() is extraordinarily expensive
and is not the right solution to zero-copy networking.

>
> So saying "I really don't like having non-immutable data in a pipe" is
> complete nonsense. It's syntactically correct English, but it makes no
> conceptual sense.
>
> You can say "I don't like 'splice()'". That's fine. I used to think
> splice was a really cool concept, but I kind of hate it these days.
> Not liking splice() makes a ton of sense.
>
> But given splice, saying "I don't like non-immutable data" really is
> complete nonsense.

I am saying exactly what I meant. Obviously mutable data exists. I'm
saying that *putting it in a pipe* *while it's still mutable* is not
good. Which implies that I don't think splice() is good. No offense.

I am *not* saying that the mere existence of mutable data is a problem.

> That's not something specific to "splice()". It's fundamental to the
> whole *concept* of zero-copy. If you don't want copies, and the source
> file changes, then you see those changes.

Of course! A user program copying data from a file to a network
fundamentally does this:

Step 1: start the process.
Step 2: data goes out to the actual wire or a buffer on the NIC or is
otherwise in a place other than page cache, and the kernel reports
completion.

There are many ways to make this happen. Step 1 could be starting
read() and step 2 could be send() returning. Step 1 could be be
sticking something in an io_uring queue and step 2 could be reporting
completion. Step 1 could be splice()ing to a pipe and step 2 could be
a splice from the pipe to a socket completing (and maybe even later
when the data actually goes out).

*Obviously* any change to the file between steps 1 and 2 may change
the data that goes out the wire.

> So the data lifetime - even just on just one side - can _easily_ be
> "multiple seconds" even when things are normal, and if you have actual
> network connectivity issues we are easily talking minutes.

True.

But splice is extra nasty: step 1 happens potentially arbitrarily long
before step 2, and the kernel doesn't even know which socket the data
is destined for in step 1. So step 1 can't usefully return
-EWOULDBLOCK, for example. And it's awkward for the kernel to report
errors, because steps 1 and 2 are so disconnected. And I'm not
convinced there's any corresponding benefit.


In any case, maybe io_uring gives an opportunity to do much better.
io_uring makes it *efficient* for largish numbers of long-running
operations to all be pending at once. Would an API like this work
better (very handwavy -- I make absolutely no promises that this is
compatible with existing users -- new opcodes might be needed):

Submit IORING_OP_SPLICE from a *file* to a socket: this tells the
kernel to kindly send data from the file in question to the network.
Writes to the file before submission will be reflected in the data
sent. Writes after submission may or may not be reflected. (This is
step 1 above.)

The operation completes (and is reported in the CQ) only after the
kernel knows that the data has been snapshotted (step 2 above). So
completion can be reported when the data is DMAed out or when it's
checksummed-and-copied or if the kernel decides to copy it for any
other reason *and* the kernel knows that it won't need to read the
data again for possible retransmission. As you said, this could
easily take minutes, but that seems maybe okay to me.

(And if Samba needs to make sure that future writes don't change the
outgoing data even two seconds later when the data has been sent but
not acked, then maybe a fancy API could be added to help, or maybe
Samba shouldn't be using zero copy IO in the first place!)

If the file is truncated or some other problem happens, the operation can fail.


I don't know how easy or hard this is to implement, but it seems like
it would be quite pleasant to *use* from user code, it ought to be
even faster than splice-to-pipe-then-splice-to-socket (simply because
there is less bookkeeping), and it doesn't seem like any file change
tracking would be needed in the kernel.


If this works and becomes popular enough, splice-from-file-to-pipe
could *maybe* be replaced in the kernel with a plain copy.

--Andy