Re: [PATCH net-next v4 00/27] io_uring zerocopy send

From: Pavel Begunkov
Date: Mon Jul 11 2022 - 08:56:39 EST

Next message: Dmitry Baryshkov: "Re: [PATCH 6/6] clk: qcom: apss-ipq-pll: add support for IPQ8074"
Previous message: Conor.Dooley: "Re: [PATCH] swiotlb: ensure io_tlb_default_mem spinlock always initialised"
In reply to: Pavel Begunkov: "Re: [PATCH net-next v4 00/27] io_uring zerocopy send"
Next in thread: David Ahern: "Re: [PATCH net-next v4 00/27] io_uring zerocopy send"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 7/8/22 15:26, Pavel Begunkov wrote:

On 7/8/22 05:10, David Ahern wrote:

On 7/7/22 5:49 AM, Pavel Begunkov wrote:

NOTE: Not be picked directly. After getting necessary acks, I'll be working
       out merging with Jakub and Jens.

The patchset implements io_uring zerocopy send. It works with both registered
and normal buffers, mixing is allowed but not recommended. Apart from usual
request completions, just as with MSG_ZEROCOPY, io_uring separately notifies
the userspace when buffers are freed and can be reused (see API design below),
which is delivered into io_uring's Completion Queue. Those "buffer-free"
notifications are not necessarily per request, but the userspace has control
over it and should explicitly attaching a number of requests to a single
notification. The series also adds some internal optimisations when used with
registered buffers like removing page referencing.

From the kernel networking perspective there are two main changes. The first
one is passing ubuf_info into the network layer from io_uring (inside of an
in kernel struct msghdr). This allows extra optimisations, e.g. ubuf_info
caching on the io_uring side, but also helps to avoid cross-referencing
and synchronisation problems. The second part is an optional optimisation
removing page referencing for requests with registered buffers.

Benchmarking with an optimised version of the selftest (see [1]), which sends
a bunch of requests, waits for completions and repeats. "+ flush" column posts
one additional "buffer-free" notification per request, and just "zc" doesn't
post buffer notifications at all.

NIC (requests / second):
IO size | non-zc    | zc             | zc + flush
4000    | 495134    | 606420 (+22%) | 558971 (+12%)
1500    | 551808    | 577116 (+4.5%) | 565803 (+2.5%)
1000    | 584677    | 592088 (+1.2%) | 560885 (-4%)
600     | 596292    | 598550 (+0.4%) | 555366 (-6.7%)

dummy (requests / second):
IO size | non-zc    | zc             | zc + flush
8000    | 1299916   | 2396600 (+84%) | 2224219 (+71%)
4000    | 1869230   | 2344146 (+25%) | 2170069 (+16%)
1200    | 2071617   | 2361960 (+14%) | 2203052 (+6%)
600     | 2106794   | 2381527 (+13%) | 2195295 (+4%)

Previously it also brought a massive performance speedup compared to the
msg_zerocopy tool (see [3]), which is probably not super interesting.

can you add a comment that the above results are for UDP.

Oh, right, forgot to add it

You dropped comments about TCP testing; any progress there? If not, can
you relay any issues you are hitting?

Not really a problem, but for me it's bottle necked at NIC bandwidth
(~3GB/s) for both zc and non-zc and doesn't even nearly saturate a CPU.
Was actually benchmarked by my colleague quite a while ago, but can't
find numbers. Probably need to at least add localhost numbers or grab
a better server.

Testing localhost TCP with a hack (see below), it doesn't include
refcounting optimisations I was testing UDP with and that will be
sent afterwards. Numbers are in MB/s

IO size | non-zc | zc
1200 | 4174 | 4148
4096 | 7597 | 11228

Because it's localhost, we also spend cycles here for the recv side.
Using a real NIC 1200 bytes, zc is worse than non-zc ~5-10%, maybe the
omitted optimisations will somewhat help. I don't consider it to be a
blocker. but would be interesting to poke into later. One thing helping
non-zc is that it squeezes a number of requests into a single page
whenever zerocopy adds a new frag for every request.

Can't say anything new for larger payloads, I'm still NIC-bound but
looking at CPU utilisation zc doesn't drain as much cycles as non-zc.
Also, I don't remember if mentioned before, but another catch is that
with TCP it expects users to not be flushing notifications too much,
because it forces it to allocate a new skb and lose a good chunk of
benefits from using TCP.

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1111adefd906..c4b781b2c3b1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -3218,9 +3218,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask)
/* Frags must be orphaned, even if refcounted, if skb might loop to rx path */
static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask)
{
- if (likely(!skb_zcopy(skb)))
- return 0;
- return skb_copy_ubufs(skb, gfp_mask);
+ return skb_orphan_frags(skb, gfp_mask);
}

--
Pavel Begunkov

Next message: Dmitry Baryshkov: "Re: [PATCH 6/6] clk: qcom: apss-ipq-pll: add support for IPQ8074"
Previous message: Conor.Dooley: "Re: [PATCH] swiotlb: ensure io_tlb_default_mem spinlock always initialised"
In reply to: Pavel Begunkov: "Re: [PATCH net-next v4 00/27] io_uring zerocopy send"
Next in thread: David Ahern: "Re: [PATCH net-next v4 00/27] io_uring zerocopy send"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]