copy on write for splice() from file to pipe?

From: Stefan Metzmacher
Date: Thu Feb 09 2023 - 08:56:13 EST


Hi Linus and others,

as written in a private mail before, I'm currently trying to
make use of IORING_OP_SPLICE in order to get zero copy support
in Samba.

The most important use cases are 8 Mbytes reads and writes to
files. where "memcpy" (at the lowest end copy_user_enhanced_fast_string())
is the obvious performance killer.

I have a prototype that offers really great performance
avoiding "memcpy" by using splice() (in order to get async IORING_OP_SPLICE).

So we have two cases:

1. network -> socket -> splice -> pipe -> splice -> file -> storage

2. storage -> file -> splice -> pipe -> splice -> socket -> network

With 1. I guess everything can work reliable, once
the pages are created/filled in the socket receive buffer
they are used exclusively and they won't be shared on
the way to the file. Which means we can be sure that
data arrives unmodified in the file(system).

But with 2. there's a problem, as the pages from the file,
which are spliced into the pipe are still shared without
copy on write with the file(system). It means writes to the file
after the first splice modify the content of the spliced pages!
So the content may change uncontrolled before it reaches the network!
I created a little example that demonstrates the problem (see below),
it gives the following output:

open(O_TMPFILE) => ffd[3]
pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[4096]
pipe() => ret[0]
splice(count=PIPE_BUF*2,ofs=0) sret[8192]
pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[4096]
pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[4096]
read(from_pipe, count=PIPE_BUF) sret[4096]
memcmp() at ofs=0, expecting 0x00 => ret[240]
memcmp() at ofs=0, checking for 0xf0 => ret[0]
read(from_pipe, count=PIPE_BUF) sret[4096]
memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[209]
memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[0]

After reading from the pipe we get the values we have written to
the file instead of the values we had at the time of splice.

For things like web servers, which mostly serve static content, this
isn't a problem, but it is for Samba, when reads and writes may happen within
microseconds, before the content is pushed to the network.

I'm wondering if there's a possible way out of this, maybe triggered by a new
flag passed to splice.

I looked through the code and noticed the existence of IOMAP_F_SHARED.
Maybe the splice from the page cache to the pipe could set IOMAP_F_SHARED,
while incrementing the refcount and the network driver could remove it again
when the refcount reaches 1 again.

Is there any other way we could archive something like this?

In addition and/or as alternative I was thinking about a flag to
preadv2() (and IORING_OP_READV) to indicate the use of something
like async_memcpy(), instead of doing the copy via the cpu.
That in combination with IORING_OP_SENDMSG_ZC would avoid "memcpy"
on the cpu.

Any hints, remarks and prototype patches are highly welcome.

Thanks!
metze

#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <fcntl.h>
#include <limits.h>

int main(void)
{
int ffd;
int pfds[2];
char buf [PIPE_BUF] = {0, };
char buf2 [PIPE_BUF] = {0, };
ssize_t sret;
int ret;
off_t ofs;

memset(buf, 0x1f, PIPE_BUF);

ffd = open("/tmp/", O_RDWR | O_TMPFILE, S_IRUSR | S_IWUSR);
printf("open(O_TMPFILE) => ffd[%d]\n", ffd);

sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0x1f sret[%zd]\n", sret);

ret = pipe(pfds);
printf("pipe() => ret[%d]\n", ret);

ofs = 0;
sret = splice(ffd, &ofs, pfds[1], NULL, PIPE_BUF*2, 0);
printf("splice(count=PIPE_BUF*2,ofs=0) sret[%zd]\n", sret);

memset(buf, 0xf0, PIPE_BUF);

sret = pwrite(ffd, buf, PIPE_BUF, 0);
printf("pwrite(count=PIPE_BUF,ofs=0) 0xf0 sret[%zd]\n", sret);
sret = pwrite(ffd, buf, PIPE_BUF, PIPE_BUF);
printf("pwrite(count=PIPE_BUF,ofs=PIPE_BUF) 0xf0 sret[%zd]\n", sret);

sret = read(pfds[0], buf, PIPE_BUF);
printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

memset(buf2, 0x00, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=0, expecting 0x00 => ret[%d]\n", ret);
memset(buf2, 0xf0, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=0, checking for 0xf0 => ret[%d]\n", ret);

sret = read(pfds[0], buf, PIPE_BUF);
printf("read(from_pipe, count=PIPE_BUF) sret[%zd]\n", sret);

memset(buf2, 0x1f, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=PIPE_BUF, expecting 0x1f => ret[%d]\n", ret);
memset(buf2, 0xf0, PIPE_BUF);
ret = memcmp(buf, buf2, PIPE_BUF);
printf("memcmp() at ofs=PIPE_BUF, checking for 0xf0 => ret[%d]\n", ret);
return 0;
}