Re: [PATCH 0/8] Various io_uring micro-optimizations (reducing lock contention)
From: Jens Axboe
Date: Sat Feb 01 2025 - 10:26:14 EST
On 1/31/25 9:13 AM, Jens Axboe wrote:
> On 1/29/25 11:01 AM, Max Kellermann wrote:
>> On Wed, Jan 29, 2025 at 6:45?PM Jens Axboe <axboe@xxxxxxxxx> wrote:
>>> Why are you combining it with epoll in the first place? It's a lot more
>>> efficient to wait on a/multiple events in io_uring_enter() rather than
>>> go back to a serialize one-event-per-notification by using epoll to wait
>>> on completions on the io_uring side.
>>
>> Yes, I wish I could do that, but that works only if everything is
>> io_uring - all or nothing. Most of the code is built around an
>> epoll-based loop and will not be ported to io_uring so quickly.
>>
>> Maybe what's missing is epoll_wait as io_uring opcode. Then I could
>> wrap it the other way. Or am I supposed to use io_uring
>> poll_add_multishot for that?
>
> Not a huge fan of adding more epoll logic to io_uring, but you are right
> this case may indeed make sense as it allows you to integrate better
> that way in existing event loops. I'll take a look.
Here's a series doing that:
https://git.kernel.dk/cgit/linux/log/?h=io_uring-epoll-wait
Could actually work pretty well - the last patch adds multishot support
as well, which means we can avoid the write lock dance for repeated
triggers of this epoll event. That should actually end up being more
efficient than regular epoll_wait(2).
Wrote a basic test cases to exercise it, and it seems to work fine for
me, but obviously not super well tested just yet. Below is the liburing
diff, just adds the helper to prepare one of these epoll wait requests.
diff --git a/src/include/liburing.h b/src/include/liburing.h
index 49b4edf437b2..a95c475496f4 100644
--- a/src/include/liburing.h
+++ b/src/include/liburing.h
@@ -729,6 +729,15 @@ IOURINGINLINE void io_uring_prep_listen(struct io_uring_sqe *sqe, int fd,
io_uring_prep_rw(IORING_OP_LISTEN, sqe, fd, 0, backlog, 0);
}
+struct epoll_event;
+IOURINGINLINE void io_uring_prep_epoll_wait(struct io_uring_sqe *sqe, int fd,
+ struct epoll_event *events,
+ int maxevents, unsigned flags)
+{
+ io_uring_prep_rw(IORING_OP_EPOLL_WAIT, sqe, fd, events, maxevents, 0);
+ sqe->epoll_flags = flags;
+}
+
IOURINGINLINE void io_uring_prep_files_update(struct io_uring_sqe *sqe,
int *fds, unsigned nr_fds,
int offset)
diff --git a/src/include/liburing/io_uring.h b/src/include/liburing/io_uring.h
index 765919883cff..bc725787ceb7 100644
--- a/src/include/liburing/io_uring.h
+++ b/src/include/liburing/io_uring.h
@@ -73,6 +73,7 @@ struct io_uring_sqe {
__u32 futex_flags;
__u32 install_fd_flags;
__u32 nop_flags;
+ __u32 epoll_flags;
};
__u64 user_data; /* data to be passed back at completion time */
/* pack this to avoid bogus arm OABI complaints */
@@ -262,6 +263,7 @@ enum io_uring_op {
IORING_OP_FTRUNCATE,
IORING_OP_BIND,
IORING_OP_LISTEN,
+ IORING_OP_EPOLL_WAIT,
/* this goes last, obviously */
IORING_OP_LAST,
@@ -388,6 +390,11 @@ enum io_uring_op {
#define IORING_ACCEPT_DONTWAIT (1U << 1)
#define IORING_ACCEPT_POLL_FIRST (1U << 2)
+/*
+ * epoll_wait flags, stored in sqe->epoll_flags
+ */
+#define IORING_EPOLL_WAIT_MULTISHOT (1U << 0)
+
/*
* IORING_OP_MSG_RING command types, stored in sqe->addr
*/
--
Jens Axboe