Re: [PATCH 1/3] io_uring: Add REQ_F_CQE_SKIP support for io_uring zerocopy

From: Pavel Begunkov
Date: Fri Apr 05 2024 - 09:01:19 EST


On 4/4/24 23:17, Oliver Crumrine wrote:
In his patch to enable zerocopy networking for io_uring, Pavel Begunkov
specifically disabled REQ_F_CQE_SKIP, as (at least from my
understanding) the userspace program wouldn't receive the
IORING_CQE_F_MORE flag in the result value.

No. IORING_CQE_F_MORE means there will be another CQE from this
request, so a single CQE without IORING_CQE_F_MORE is trivially
fine.

The problem is the semantics, because by suppressing the first
CQE you're loosing the result value. You might rely on WAITALL
as other sends and "fail" (in terms of io_uring) the request
in case of a partial send posting 2 CQEs, but that's not a great
way and it's getting userspace complicated pretty easily.

In short, it was left out for later because there is a
better way to implement it, but it should be done carefully


To fix this, instead of keeping track of how many CQEs have been
received, and subtracting notifs from that, programs can keep track of

That's a benchmark way of doing it, more realistically
it'd be more like

event_loop() {
cqe = wait_cqe();
struct req *r = (struct req *)cqe->user_data;
r->callback(r, cqe);
}

send_zc_callback(req, cqe) {
if (cqe->flags & F_MORE) {
// don't free the req
// we should wait for another CQE
...
}
}

how many SQEs they have issued, and if a CQE is returned with an error,
they can simply subtract from how many notifs they expect to receive.

The design specifically untangles those two notions, i.e. there can
be a notification even when the main CQE fails (ret<0). It's safer
this way, even though AFAIK relying on errors would be fine with
current users (TCP/UDP).


Signed-off-by: Oliver Crumrine <ozlinuxc@xxxxxxxxx>
---
io_uring/net.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 1e7665ff6ef7..822f49809b68 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -1044,9 +1044,6 @@ int io_send_zc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
if (unlikely(READ_ONCE(sqe->__pad2[0]) || READ_ONCE(sqe->addr3)))
return -EINVAL;
- /* we don't support IOSQE_CQE_SKIP_SUCCESS just yet */
- if (req->flags & REQ_F_CQE_SKIP)
- return -EINVAL;
notif = zc->notif = io_alloc_notif(ctx);
if (!notif)
@@ -1342,7 +1339,8 @@ void io_sendrecv_fail(struct io_kiocb *req)
req->cqe.res = sr->done_io;
if ((req->flags & REQ_F_NEED_CLEANUP) &&
- (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC))
+ (req->opcode == IORING_OP_SEND_ZC || req->opcode == IORING_OP_SENDMSG_ZC) &&
+ !(req->flags & REQ_F_CQE_SKIP))
req->cqe.flags |= IORING_CQE_F_MORE;
}

--
Pavel Begunkov