Re: [PATCH] io_uring/net: don't fail linked ops when done_io > 0
From: Stefan Metzmacher
Date: Fri Feb 27 2026 - 09:04:42 EST
Hi Hannes,
Am 26.02.26 um 23:03 schrieb Hannes Furmans:
When io_uring recv/send with MSG_WAITALL accumulates partial data
through done_io and then encounters an error or EOF, req_set_fail()
sets REQ_F_FAIL despite the CQE result being positive (done_io bytes).
io_disarm_next() then sees REQ_F_FAIL and cancels all linked operations
with -ECANCELED, even though the user-visible result indicates success.
This manifests in two code paths:
1) Direct completion: io_recv/io_send fall through to req_set_fail()
when ret < min_ret, even if done_io > 0. The CQE shows done_io
(positive) but REQ_F_FAIL severs the link chain.
2) io-wq fallback: after APOLL_MAX_RETRY (128) poll retries, the
request moves to io-wq. io_recv returns IOU_RETRY from the
MSG_WAITALL retry path, io-wq fails the request with -EAGAIN, and
io_req_defer_failed -> io_sendrecv_fail overwrites cqe.res with
done_io but leaves REQ_F_FAIL set.
Fix this by:
- Not calling req_set_fail() when done_io > 0 in io_recv, io_recvmsg,
io_send, io_sendmsg, io_send_zc, io_sendmsg_zc
- Clearing REQ_F_FAIL in io_sendrecv_fail() when done_io > 0
This makes MSG_WAITALL partial completions consistent with
non-MSG_WAITALL behavior, where positive results never sever the
IO_LINK chain.
Reproducer: MSG_WAITALL recv via IO_LINK -> write on a UNIX socketpair
where the sender closes after partial data. The recv CQE shows positive
bytes but the linked write gets -ECANCELED.
Fixes: 0031275d119e ("io_uring: call req_set_fail_links() on short send[msg]()/recv[msg]() with MSG_WAITALL")
That's by design, if a MSG_WAITALL calls fails it means
not call data the caller expected arrived or were sent.
When there's a LINK after that the linked operation likely
relies on all expected data being processed! Otherwise
the message stream can get out of sync and causes corruption.
Let's assume I want to send a message header with
IO_SEND linked with a IO_SPLICE to send the payload.
If IO_SEND returns short the situation needs to be
recovered by the caller instead of letting the
IO_SPLICE give more data to the socket.
So the current behavior is exactly what MSG_WAITALL
gives you. If you don't want that why are you using it
at all?
metze