Re: [PATCH V4 0/2] ublk: add io_uring based userspace block driver

From: Xiaoguang Wang
Date: Tue Jul 12 2022 - 11:20:45 EST


>>> If we adopt to pass one io_uring fd per queue when starting device,
>>> blk-mq's queue_rq() will get corresponding io_uring file for this queue and
>>> use it to generate cqes directly to notify new io commands inserted,
>>> then UBLK_CMD_START_DEV doesn't need to wait, and can be in the
>>> same thread with ublksrv_queue_init or ublksrv_process_io.
>>> Seems that for demo_null.c, they are still in different threads.
>>> For current io_uring implementation, one sqe may generate one or more
>>> cqes, but indeed, we can generate cqes without submitting sqes, just
>>> fill one event to io_uring ctx.
>>> Just suggestions :)
> I don't have any interest in so unusual usage of io_uring, especially it
> needs fundamental change in io_uring.
Got it, I see your point now and will respect that.

> Compared with kernel side change, removing waiting until queue setup in
> userspace side simplifies nothing.
> You still need io_kiocb/io_uring_cmd and both are allocated in
> submission code path, since 'io_ring_ctx' is private for
> io_uring.
> And ublk server still needs one command to send io result to ublk driver,
> that said this way saves nothing for us cause ublk driver uses single
> command for both fetching io request and committing io result.
> Easy to say than done, you may try to write a patch to verify your
> ideas, especially no one uses io_ring in this way.
I have done a hack change in io_uring:
iff --git a/fs/io_uring.c b/fs/io_uring.c
index 5ff2cdb425bc..e6696319148e 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -11958,6 +11958,21 @@ static int io_get_ext_arg(unsigned flags, const void __user *argp, size_t *argsz
        return 0;

+static u64 tst_count;
+static void io_gen_cqe_direct(struct file *file)
+       struct io_ring_ctx *ctx;
+       ctx = file->private_data;
+       printk(KERN_ERR "tst_count %llu\n", tst_count);
+       spin_lock(&ctx->completion_lock);
+       io_fill_cqe_aux(ctx, tst_count++, 0, 0);
+       io_commit_cqring(ctx);
+       spin_unlock(&ctx->completion_lock);
+       io_cqring_ev_posted(ctx);
 SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
                u32, min_complete, u32, flags, const void __user *, argp,
                size_t, argsz)
@@ -12005,6 +12020,7 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int, fd, u32, to_submit,
        if (unlikely(ctx->flags & IORING_SETUP_R_DISABLED))
                goto out;

+       io_gen_cqe_direct(f.file);
         * For SQ polling, the thread will do all submissions and completions.
         * Just return the requested submit count, and wake the thread if

And in user-space:
ret = io_uring_queue_init(QD, &ring, 0);

for (i = 0; i < 100; i++) {
    io_uring_submit_and_wait(&ring, 0);
    io_uring_wait_cqe(&ring, &cqe);
    printf("lege user_data %llu\n", cqe->user_data);
    io_uring_cqe_seen(&ring, cqe);

Now user-space app will get cqes continually, by it does not submit any sqes.
Indeed, io_fill_cqe_aux() has been used somewhere in kernel io_uring, for example,
sent one cqe from one io_uring instance to another instance.

I had planned to use io_gen_cqe_direct() in ublk's queue_rq to notify io requests
inserted. But now I'm fine that dropping this idea.
>> As I said before, there maybe such benefits:
>> 1. may decouple io command descriptor acquire and io command handling well.
>> At least helper like tcmulib_get_next_command maybe added easily. I'm not sure, some
>> applications based on tcmu previously may need this helper.
> Why do we need similar helper of tcmulib_get_next_command()? for what
> purpose? In current design of ublk server, it should be enough for target
> code to focus on ->handle_io_async(), ->target_io_done() in case of handling
> target io via ubq io_uring, ->handle_event() in case of handling target io in
> other contexts.
I'll have a deep think about it. Initially I tried to make libublk offer
similar programming interface to libtcmu, so apps can switch to ublk
easily, but indeed they maybe totally different things.

Sorry for noises again.
> ublk server is one io_uring application, and interfaces provided
> are based on io_uring design/interfaces. I guess TCMU isn't based on
> io_uring, so it may have original style interface. You may ask
> TCMU guys if it is possible to reuse current interface for supporting
> io_uring with expected performance.
>> 2. UBLK_CMD_START_DEV won't need to wait another thread context to submit
>> number of queue depth of sqes firstly, but I admit that it's not a big issue.
> Yeah, it simplifies nothing, compared with fundamental io_uring change
> and ublk driver side change.
>> I admit batch will be good, and syscalls userspace and kernel context switch
>> introduce overhead. But for big io requests, batch in one context is not good. In
>> the example of read requests, if io request is big, say 1MB, io_uring will do
>> commit req sequentially, which indeed mainly do memcpy work. But if users
>> can choose to issue multiple ioctls which do commit req concurrently, I think
>> user may get higher io throughput.
> If BS is big, single job could saturate devices bw easily, multiple jobs won't
> make a difference, will it? Not mention sync cost among multiple jobs.n
No, I don't consider device scenes, at least for us, our target will visit distributed
file systems, which can offer very high io throughput, far greater than normal device.
> Not mention current ublk server does support multiple io jobs.
Yeah, I see that. But for read request, commit req commands will still be done
in ubq->ubq_daemon task context, and commit req commands mainly do memcpy work.

Xiaoguang Wang

>> And in this case, user may not care userspace and kernel context switch overhead at all.
>> Or to put it another way, should libublk offer synchronous programming interface ?
> It depends what the sync interface is.
> You can call sync read()/write() for target io directly in ->handdle_io_async(),
> or call them in other pthread(s) just by telling ubq after these sync ios are done
> with the eventfd API.
> demo_null.c is actually one such example.
>>> With ublk, usually we handle dozens or even more than hundred of IOs in
>>> single io_uring_enter() syscall.
>>>> io workers can run concurrently. Since GET_DATA(write request)
>>>> or COMMIT_REQ(read request) mainly do memcpy work, one
>>>> io_uring instance will just do these jobs sequentially, which may
>>>> not take advantage of multi-cpu.
>>> IMO you can implement target code to handle io in other pthreads against
>>> current libublksrv design, see demo_event.c. Or if you think it is still
>>> enough, please just share with us what the problem is. Without
>>> understanding the problem, I won't try to figure out any solution or
>>> change.
>> I need to read your ublk userspace codes carefully, if I made
> One small suggestion, you may start with the two builtin examples, both
> are simple & clean, and the big one is only 300+ LOC, really complicated?
> Thanks.
> Ming