Re: [PATCH 0/3] vmsplice: make vmsplice a trivial wrapper for preadv2/pwritev2

From: Andy Lutomirski

Date: Thu Jun 04 2026 - 13:39:39 EST

On Thu, Jun 4, 2026 at 9:09 AM Willy Tarreau <w@xxxxxx> wrote:
>
> On Thu, Jun 04, 2026 at 08:53:15AM -0700, Andy Lutomirski wrote:
> > On Wed, Jun 3, 2026 at 11:32 PM Willy Tarreau <w@xxxxxx> wrote:
> > >
> > > On Mon, Jun 01, 2026 at 05:28:25PM -0700, Andrew Morton wrote:
> > > > On Mon, 1 Jun 2026 16:04:55 -0400 Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> > > >
> > > > > On Mon, 1 Jun 2026 18:33:25 +0100
> > > > > Al Viro <viro@xxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > > On Mon, Jun 01, 2026 at 10:17:23AM -0700, Linus Torvalds wrote:
> > > > > >
> > > > > > > TLDR: maybe we could ghet rid of "f_op->splice_read". *That* would be
> > > > > > > a big simplification.
> > > > > >
> > > > > > FUSE might be interesting - fuse_dev_splice_read() and its ilk.
> > > > > > Communications between the kernel and fuse server at least used to
> > > > > > seriously want that, so that would be one place to look for unhappy
> > > > > > userland...
> > > > > >
> > > > > > splice-related logics in fs/fuse/dev.c is interesting; another place
> > > > > > like this is kernel/trace/, but I'm less familiar with that one.
> > > > > >
> > > > > > rostedt Cc'd (miklos already had been)
> > > > >
> > > > > Thanks for the Cc. The tracing ring buffer was specifically made to be used
> > > > > by splice and the libtracefs has a lot of code to use it as well. As
> > > > > reading the ring buffer literally swaps out the write portion with a blank
> > > > > read portion, that portion (sub-buffer) is used to be directly fed into
> > > > > splice, providing a zero-copy of the trace data from the write of the event
> > > > > to going into a file.
> > > > >
> > > > > trace-cmd defaults to using splice to copy the tracing ring buffer directly
> > > > > into files to avoid as much copying during live recordings as possible.
> > > > >
> > > > > Whatever changes we make, I would like to make sure there's no regressions
> > > > > in performance of trace-cmd record.
> > > >
> > > > Well yes, The patchset seems sensible from a quality POV. But to make
> > > > a decision we should first have a decent understanding of its downside
> > > > impact.
> > > >
> > > > I haven't seen a description of that impact in the discussion thus far.
> > > > And that description is owed, please.
> > > >
> > > > I assume a small number of specialized applications are using
> > > > vmsplice() to great effect? What are those applications? What is the
> > > > impact of this change?
> > >
> > > > Once we are armed with that information, is there some middle ground in
> > > > which we de-feature vmsplice()? Fall back to pread/pwrite in the
> > > > tricky cases and still permit vmsplicing if the application is
> > > > appropriately restrictive in it usage?
> > >
> > > I'm using vmsplice() + tee() + splice() in high-performance applications,
> > > load generators to be precise, and soon a cache. This is super convenient
> > > and extremely efficient:
> > >
> > > - vmsplice() is used to prepare a "master" pipe with data to be sent
> > > over TCP or kTLS
> > > - then for each request, we do tee() from this master pipe to per-request
> > > pipes.
> > > - the per-request pipes are those that are used to deliver the data to
> > > the socket via splice().
> > >
> > > So we effectively use vmsplice(), tee() and splice() here, and for exactly
> > > the reasons they were designed: only play with page refcount and not copy
> > > data. The code is here for the curious:
> > >
> > > https://git.haproxy.org/?p=haproxy.git;a=blob;f=src/haterm.c
> > >
> > > and its ancestor is here:
> > >
> > > https://github.com/wtarreau/httpterm/blob/master/httpterm.c
> > >
> > > It simply doubles the network bandwidth compared to not using that.
> > > (62 Gbps per core vs 31). I would seriously miss it if I couldn't use
> > > this anymore.
> > >
> >
> > Wait a moment. This is neat, but it's literally just a benchmark,
> > right?
>
> No, it's a benchmark *tool*: it's being used to stress production code,
> which is important and super hard at high loads. You place it after your
> proxy and you measure the performance of the proxy (which is supposed not
> to be as capable as the testing tools otherwise the methodology revolves
> to testing the testing tools, which is not the point).
>
> > I skimmed the code, and it doesn't look like a production
> > workload, either. And you manage to get around the awfulness of the
> > vmsplice API's complete failure to tell you when it's done with a
> > buffer by ... never actually changing the contents of the buffer. Do
> > you have any idea how you would write correct code that uses vmsplice
> > for sends and then *ever* mutates the data without literally
> > munmapping (or madvise or something) the data do you can safely mutate
> > it?
>
> I'm not sure what you mean here Andy. I *do not* need to change the
> data, it's just a pre-made pattern.

What I mean is: this particular pattern seems limited for use in an
actual webserver as a opposed to a load-tester.

> > Or discover that we already have something better, perhaps :)
> >
> > https://man7.org/linux/man-pages/man3/io_uring_prep_send_zc.3.html
>
> io_uring is different. We tried it "the dirty way" in the past, by
> emulating a poller, and it's not worth it this way. And in order to
> do it the right way, it needs to be done totally differently, which
> has impacts all over the stack. The code in the file pointed to above
> is just for the httpterm testing feature, but the rest is much more
> complex.

I'm curious how this kludge does:

https://github.com/amluto/zc_bench

I vibe-coded this up without much care, and I don't have the hardware
needed to actually run it in an interesting manner. But, on a Linux
VM on an Apple M4, I can push about 130Gbps on a single core over
loopback. In theory this will do zerocopy sends (but not over
loopback), and I would hope that it runs *faster* than vmsplice + tee.

(I have a fancy workstation that can do a whopping 2.5Gbps. I could
probably jury-rig a test over Thunderbolt at higher speeds. I have
systems that are not available for this test right now that can do
10Gbps. But someone probably needs 40Gbps or better hardware for a
genuinely interesting test.)

--
Andy Lutomirski
AMA Capital Management, LLC