Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

From: Stefan Metzmacher

Date: Fri Jun 05 2026 - 06:02:22 EST

Hi Linus,

Am I understanding correctly that this will completely break zerocopy
sendfile?

Very much, yes.

And it's worth making it very very clear that ABSOLUTELY NONE of the
recent big security bugs were in splice.

They were all in the networking and crypto code that just didn't deal
with shared data correctly.

So in that sense, it's a bit sad to discuss castrating splice.

But it's probably still the right thing to at least try.

I've seen very impressive benchmark numbers over the years, but
they've often smelled more like benchmarketing than actual real work.

There's also a real possibility that a lot of the sendfile / splice
advantage has little to do with zero-copy, and more to do with the
cost of mapping and maintaining buffers in user space.

If you are sending file data using plain reads and writes, it's not
just the "copy from user space to socket data structures".

There's also the cost of populating user space in the first place:
page faults for mmap made *that* historical copy avoidance basically a
fairy tale.

And not using mmap means that you have the cost of double caching in
the kernel _and_ user space etc.

So sendfile() as a concept (whether you use combinations of splice()
system calls or the sendfile system call itsefl) isn't necessarily
only about the zero-copy, it's really also about avoiding the user
space memory management.

I don't think so. Ok, maybe for webservers just serving tiny
html files, that's true. But for me with Samba it's really the
copy_to/from_iter() that is the major factor.

We can use io_uring with IOSQE_ASYNC in order to offload
the memcpy cpu wasting to different cores, but it's still
wasting a lot of resources.

For the case of filesystem => socket, we can use
IORING_OP_SENDMSG_ZC and that at least removes the
copy_from_iter() in the sendmsg path, but the
IORING_OP_READV of buffers in the sizes up to 8MBytes
is wasting cpu in copy_to_iter().

For the case with smbdirect and RDMA offload with 2x200GBit/s links
changes from only ~33GBytes/s are used (and the server cpu even if using multiple cores)
is the limit. Without the memcpy waste ~46GByte/s is easily reached
and the limit is just the network link.

Maybe another solution could be having a version of
copy_to/from_iter that uses async_memcpy(), but didn't
have the time to experiment with that yet. Maybe a new flag
to preadv2/pwritev2 could control that, so that the
application can decide what's better.

But without an alternative please don't kill splice.

A lot of people are frustrated because they bought hardware
that is able to handle a lot of throughput, but
e.g. with the default of smb over tcp they get no
higher than 3.5GByte/s on a 100GBit/s link that's able
to handle ~11GBytes/s. And io_uring and splice are
a key factor to fix that.

Thanks!
metze