Re: [PATCH v1] io_uring: reserve word at cqring tail+4 for the user

From: Jens Axboe
Date: Tue Sep 17 2019 - 10:54:45 EST


On 9/17/19 3:13 AM, Avi Kivity wrote:
> In some applications, a thread waits for I/O events generated by
> the kernel, and also events generated by other threads in the same
> application. Typically events from other threads are passed using
> in-memory queues that are not known to the kernel. As long as the
> threads is active, it polls for both kernel completions and
> inter-thread completions; when it is idle, it tells the other threads
> to use an I/O event to wait it up (e.g. an eventfd or a pipe) and
> then enters the kernel, waiting for such an event or an ordinary
> I/O completion.
>
> When such a thread goes idle, it typically spins for a while to
> avoid the kernel entry/exit cost in case an event is forthcoming
> shortly. While it spins it polls both I/O completions and
> inter-thread queues.
>
> The x86 instruction pair UMONITOR/UMWAIT allows waiting for a cache
> line to be written to. This can be used with io_uring to wait for a
> wakeup without spinning (and wasting power and slowing down the other
> hyperthread). Other threads can also wake up the waiter by doing a
> safe write to the tail word (which triggers the wakeup), but safe
> writes are slow as they require an atomic instruction. To speed up
> those wakeups, reserve a word after the tail for user writes.
>
> A thread consuming an io_uring completion queue can then use the
> following sequences:
>
> - while busy:
> - pick up work from the completion queue and from other threads,
> and process it
>
> - while idle:
> - use UMONITOR/UMWAIT to wait on completions and notifications
> from other threads for a short period
> - if no work is picked up, let other threads know you will need
> a kernel wakeup, and use io_uring_enter to wait indefinitely

This is cool, I like it. A few comments:

> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index cfb48bd088e1..4bd7905cee1d 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -77,12 +77,13 @@
>
> #define IORING_MAX_ENTRIES 4096
> #define IORING_MAX_FIXED_FILES 1024
>
> struct io_uring {
> - u32 head ____cacheline_aligned_in_smp;
> - u32 tail ____cacheline_aligned_in_smp;
> + u32 head ____cacheline_aligned;
> + u32 tail ____cacheline_aligned;
> + u32 reserved_for_user; // for cq ring and UMONITOR/UMWAIT (or similar) wakeups
> };

Since we have that full cacheline, maybe name this one a bit more
appropriately as we can add others if we need it. Not a big deal.
But definitely use /* */ style comments :-)

> diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h
> index 1e1652f25cc1..1a6a826a66f3 100644
> --- a/include/uapi/linux/io_uring.h
> +++ b/include/uapi/linux/io_uring.h
> @@ -103,10 +103,14 @@ struct io_sqring_offsets {
> */
> #define IORING_SQ_NEED_WAKEUP (1U << 0) /* needs io_uring_enter wakeup */
>
> struct io_cqring_offsets {
> __u32 head;
> + // tail is guaranteed to be aligned on a cache line, and to have the
> + // following __u32 free for user use. This allows using e.g.
> + // UMONITOR/UMWAIT to wait on both writes to head and writes from
> + // other threads to the following word.
> __u32 tail;
> __u32 ring_mask;
> __u32 ring_entries;
> __u32 overflow;
> __u32 cqes;

Ditto on the comments here.

Would be ideal if we could pair this with an example for liburing, a basic
test case would be fine. Something that shows how to use it, and verifies
that it works.

Also, this patch is against master, it should be against for-5.4/io_iuring as
it won't apply there right now.

--
Jens Axboe