Re: [PATCH V4 0/2] ublk: add io_uring based userspace block driver

From: Xiaoguang Wang
Date: Mon Jul 11 2022 - 07:58:11 EST

Hello Ming,

First thanks for this great work, I think ublk will be a powerful
replacement for tcmu. I read your v3 ublk kernel and user-space
codes, and have some ublk design questions here:

1) As I said before, currently ublk still needs user-space backstore
to allocate io buffer in advance, for example, UBLK_IO_FETCH_REQ
command needs to set ublk_io->addr when starting device. Currently,
some of our internal business may create hundreds or thousands of
tcmu devices in one host, when switching to ublk, it'll need user-space
backstore to allocate lots of memory in advance, which will waste memory
in a host with swap off. Also user-space backstore may use advanced
network components, they may maintain internal memory pool, which
can be used as io buffer.

So I'd like to suggest that at least we add a new flag and keep GET_DATA
command. When used, FETCH command does not need to pass io
buffer addr, and backstore needs to send a GET_DATA command with
user addr for write request.

Support high flexibility, let's user decides what's best for them.

2) complicated ublk user-space
First forgive me  I think current ublk user-space codes looks somewhat
  1. UBLK_CMD_START_DEV and io handler worker need to be
in different task context, because UBLK_CMD_START_DEV needs
to wait the number of queue depth of sqes to be submitted in advance.

  2. mixed getting ublk command and target io handle in one io_uring instance
I'm not sure it's a good design, see ublksrv_handle_cqe(), which contains
many flag handle work and is_target_io() check, I think the data flow is not
that clear for me at least

  3. helper like tcmulib_get_next_command()
I wonder whether current ublk user-space can offer similar helper which
will return a set of io commands to backstore easily.

I'd like to suggest:
1. When starting ublk dev, pass io_uring fd for every queue, then in
blk-mq->queue_rq(), it'll generate one cqe for every coming request,
not need to issue fetch sqes command in advance, kernel codes would
simplify a bit,  UBLK_IO_FLAG_ACTIVE may be discarded. And helper
like returning a set of io command would be added easily. Note these
io_uring fd would be just used for notifying io command generated.

2. We use another io_uring fd per queue to handle GET_DATA or
COMMIT_REQ command. Indeed, if we can support synchronous ioctl
interface to do GET_DATA and COMMIT_REQ, we may make libublk
really simple.

Here I'd like to describe how we use tcmu. A main thread call
tcmulib_get_next_command() to get a set of io commands, then
it dispatches them to user-space io wokers. Take write requests as
example, io worker use ioctl(2) to get data from bios, and send
data to distributed fs, finally call ioctl(2) to commit req. Multiple
io workers can run concurrently. Since GET_DATA(write request)
or COMMIT_REQ(read request) mainly do memcpy work, one
io_uring instance will just do these jobs sequentially, which may
not take advantage of multi-cpu.

So finally, I would suggest ublk or libulk just offer basic interface:
add/start/delete dev interface, get io commands sescriptor, get io
data, commit io helper. Let's user-space target make decisions,
for example, whether to use eventfd.

Thanks for your patience

Xiaoguang Wang

> Hello Guys,
> ublk driver is one kernel driver for implementing generic userspace block
> device/driver, which delivers io request from ublk block device(/dev/ublkbN) into
> ublk server[1] which is the userspace part of ublk for communicating
> with ublk driver and handling specific io logic by its target module.
> Another thing ublk driver handles is to copy data between user space buffer
> and request/bio's pages, or take zero copy if mm is ready for support it in
> future. ublk driver doesn't handle any IO logic of the specific driver, so
> it is small/simple, and all io logics are done by the target code in ublkserver.
> The above two are main jobs done by ublk driver.
> ublk driver can help to move IO logic into userspace, in which the
> development work is easier/more effective than doing in kernel, such as,
> ublk-loop takes < 200 lines of loop specific code to get basically same
> function with kernel loop block driver, meantime the performance is
> is even better than kernel loop with same setting. ublksrv[1] provide built-in
> test for comparing both by running "make test T=loop", for example, see
> the test result running on VM which is over my lattop(root disk is
> nvme/device mapper/xfs):
> [root@ktest-36 ubdsrv]#make -s -C /root/git/ubdsrv/tests run T=loop/001 R=10
> running loop/001
> fio (ublk/loop(/root/git/ubdsrv/tests/tmp/ublk_loop_VqbMA), libaio, bs 4k, dio, hw queues:1)...
> randwrite: jobs 1, iops 32572
> randread: jobs 1, iops 143052
> rw: jobs 1, iops read 29919 write 29964
> [root@ktest-36 ubdsrv]# make test T=loop/003
> make -s -C /root/git/ubdsrv/tests run T=loop/003 R=10
> running loop/003
> fio (kernel_loop/kloop(/root/git/ubdsrv/tests/tmp/ublk_loop_ZIVnG), libaio, bs 4k, dio, hw queues:1)...
> randwrite: jobs 1, iops 27436
> randread: jobs 1, iops 95273
> rw: jobs 1, iops read 22542 write 22543
> Another example is high performance qcow2 support[2], which could be built with
> ublk framework more easily than doing it inside kernel.
> Also there are more people who express interests on userspace block driver[3],
> Gabriel Krisman Bertazi proposes this topic in lsf/mm/ebpf 2022 and mentioned
> requirement from Google. Ziyang Zhang from Alibaba said they "plan to
> replace TCMU by UBD as a new choice" because UBD can get better throughput than
> TCMU even with single queue[4], meantime UBD is simple. Also there is userspace
> storage service for providing storage to containers.
> It is io_uring based: io request is delivered to userspace via new added
> io_uring command which has been proved as very efficient for making nvme
> passthrough IO to get better IOPS than io_uring(READ/WRITE). Meantime one
> shared/mmap buffer is used for sharing io descriptor to userspace, the
> buffer is readonly for userspace, each IO just takes 24bytes so far.
> It is suggested to use io_uring in userspace(target part of ublk server)
> to handle IO request too. And it is still easy for ublkserver to support
> io handling by non-io_uring, and this work isn't done yet, but can be
> supported easily with help o eventfd.
> This way is efficient since no extra io command copy is required, no sleep
> is needed in transferring io command to userspace. Meantime the communication
> protocol is simple and efficient, one single command of
> UBD_IO_COMMIT_AND_FETCH_REQ can handle both fetching io request desc and commit
> command result in one trip. IO handling is often batched after single
> io_uring_enter() returns, both IO requests from ublk server target and
> IO commands could be handled as a whole batch.
> And the patch by patch change can be found in the following
> tree:
> ublk server repo(master branch):
> Any comments are welcome!
> Since V3:
> - address Gabriel Krisman Bertazi's comments on V3: add userspace data
> validation before handling command, remove warning, ...
> - remove UBLK_IO_COMMIT_REQ command as suggested by Zixiang and Gabriel Krisman Bertazi
> - fix one request double free when running abort
> - rewrite/cleanup ublk_copy_pages(), then this handling becomes very
> clean
> - add one command of UBLK_IO_REFETCH_REQ for allowing ublk_drv to build
> as module
> Since V2:
> - fix one big performance problem:
> - rename as ublk, as suggested by Gabriel Krisman Bertazi
> - lots of cleanup & code improvement & bugfix, see details in git
> hisotry
> Since V1:
> Remove RFC now because ublk driver codes gets lots of cleanup, enhancement and
> bug fixes since V1:
> - cleanup uapi: remove ublk specific error code, switch to linux error code,
> remove one command op, remove one field from cmd_desc
> - add monitor mechanism to handle ubq_daemon being killed, ublksrv[1]
> includes builtin tests for covering heavy IO with deleting ublk / killing
> ubq_daemon at the same time, and V2 pass all the two tests(make test T=generic),
> and the abort/stop mechanism is simple
> - fix MQ command buffer mmap bug, and now 'xfstetests -g auto' works well on
> MQ ublk-loop devices(test/scratch)
> - improve batching submission as suggested by Jens
> - improve handling for starting device, replace random wait/poll with
> completion
> - all kinds of cleanup, bug fix,..
> Ming Lei (2):
> ublk: add io_uring based userspace block driver
> ublk_drv: add UBLK_IO_REFETCH_REQ for supporting to build as module
> drivers/block/Kconfig | 6 +
> drivers/block/Makefile | 2 +
> drivers/block/ublk_drv.c | 1701 +++++++++++++++++++++++++++++++++
> include/uapi/linux/ublk_cmd.h | 173 ++++
> 4 files changed, 1882 insertions(+)
> create mode 100644 drivers/block/ublk_drv.c
> create mode 100644 include/uapi/linux/ublk_cmd.h