Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy
From: Bernd Schubert
Date: Fri Mar 31 2023 - 15:55:58 EST
On 3/30/23 13:36, Ming Lei wrote:
> Apply io_uring fused command for supporting zero copy:
>
> 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and deinit it
> in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
> retrieve it from concurrent fused command.
>
> 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
> fused cmd(zero copy) buffer
>
> 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
> request and submit secondary request; meantime setup complete callback via
> this API, once secondary request is completed, the complete callback is
> called for freeing the buffer and completing the fused command
>
> Also request reference is held during fused command lifetime, and this way
> guarantees that request buffer won't be freed until all inflight fused
> commands are completed.
>
> userspace(only implement sqe128 fused command):
>
> https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
>
> liburing test(only implement normal sqe fused command: two 64byte SQEs)
>
> https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
>
> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
> ---
> Documentation/block/ublk.rst | 126 ++++++++++++++++++++--
> drivers/block/ublk_drv.c | 192 ++++++++++++++++++++++++++++++++--
> include/uapi/linux/ublk_cmd.h | 6 +-
> 3 files changed, 303 insertions(+), 21 deletions(-)
>
> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
> index 1713b2890abb..7b7aa24e9729 100644
> --- a/Documentation/block/ublk.rst
> +++ b/Documentation/block/ublk.rst
> @@ -297,18 +297,126 @@ with specified IO tag in the command data:
> ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
> the server buffer (pages) read to the IO request pages.
>
> -Future development
> -==================
> +- ``UBLK_IO_FUSED_SUBMIT_IO``
> +
> + Used for implementing zero copy feature.
> +
> + It has to been the primary command of io_uring fused command. This command
> + submits the generic secondary IO request with io buffer provided by our primary
> + command, and won't be completed until the secondary request is done.
> +
> + The provided buffer is represented as ``io_uring_bvec_buf``, which is
> + actually ublk request buffer's reference, and the reference is shared &
> + read-only, so the generic secondary request can retrieve any part of the buffer
> + by passing buffer offset & length.
>
> Zero copy
> ----------
> +=========
> +
> +What is zero copy?
> +------------------
> +
> +When application submits IO to ``/dev/ublkb*``, userspace buffer(direct io)
> +or page cache buffer(buffered io) or kernel buffer(meta io often) is used
> +for submitting data to ublk driver, and all kinds of these buffers are
> +represented by bio/bvecs(ublk request buffer) finally. Before supporting
> +zero copy, data in these buffers has to be copied to ublk server userspace
> +buffer before handling WRITE IO, or after handing READ IO, so that ublk
> +server can handle IO for ``/dev/ublkb*`` with the copied data.
> +
> +The extra copy between ublk request buffer and ublk server userspace buffer
> +not only increases CPU utilization(such as pinning pages, copy data), but
> +also consumes memory bandwidth, and the cost could be very big when IO size
> +is big. It is observed that ublk-null IOPS may be increased to ~5X if the
> +extra copy can be avoided.
> +
> +So zero copy is very important for supporting high performance block device
> +in userspace.
> +
> +Technical requirements
> +----------------------
> +
> +- ublk request buffer use
> +
> +ublk request buffer is represented by bio/bvec, which is immutable, so do
> +not try to change bvec via buffer reference; data can be read from or
> +written to the buffer according to buffer direction, but bvec can't be
> +changed
> +
> +- buffer lifetime
> +
> +Ublk server borrows ublk request buffer for handling ublk IO, ublk request
> +buffer reference is used. Reference can't outlive the referent buffer. That
> +means all request buffer references have to be released by ublk server
> +before ublk driver completes this request, when request buffer ownership
> +is transferred to upper layer(FS, application, ...).
> +
> +Also after ublk request is completed, any page belonging to this ublk
> +request can not be written or read any more from ublk server since it is
> +one block device from kernel viewpoint.
> +
> +- buffer direction
> +
> +For ublk WRITE request, ublk request buffer should only be accessed as data
> +source, and the buffer can't be written by ublk server
> +
> +For ublk READ request, ublk request buffer should only be accessed as data
> +destination, and the buffer can't be read by ublk server, otherwise kernel
> +data is leaked to ublk server, which can be unprivileged application.
> +
> +- arbitrary size sub-buffer needs to be retrieved from ublk server
> +
> +ublk is one generic framework for implementing block device in userspace,
> +and typical requirements include logical volume manager(mirror, stripped, ...),
> +distributed network storage, compressed target, ...
> +
> +ublk server needs to retrieve arbitrary size sub-buffer of ublk request, and
> +ublk server needs to submit IOs with these sub-buffer(s). That also means
> +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
> +
> +Any sub-buffer is actually one reference of ublk request buffer, which
> +ownership can't be transferred to upper layer if any reference is held
> +by ublk server.
> +
> +Why slice isn't good for ublk zero copy
> +---------------------------------------
> +
> +- spliced page from ->splice_read() can't be written
> +
> +ublk READ request can't be handled because spliced page can't be written to, and
> +extending splice for ublk zero copy isn't one good solution [#splice_extend]_
> +
> +- it is very hard to meet above requirements wrt. request buffer lifetime
> +
> +splice/pipe focuses on page reference lifetime, but ublk zero copy pays more
> +attention to ublk request buffer lifetime. If is very inefficient to respect
> +request buffer lifetime by using all pipe buffer's ->release() which requires
> +all pipe buffers and pipe to be kept when ublk server handles IO. That means
> +one single dedicated ``pipe_inode_info`` has to be allocated runtime for each
> +provided buffer, and the pipe needs to be populated with pages in ublk request
> +buffer.
> +
> +
> +io_uring fused command based zero copy
> +--------------------------------------
> +
> +io_uring fused command includes one primary command(uring command) and one
> +generic secondary request. The primary command is responsible for submitting
> +secondary request with provided buffer from ublk request, and primary command
> +won't be completed until the secondary request is completed.
> +
> +Typical ublk IO handling includes network and FS IO, so it is usual enough
> +for io_uring net & fs to support IO with provided buffer from primary command.
>
> -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
> -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to userspace
> -can't be remapped any more in kernel with existing mm interfaces. This can
> -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported that
> -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
> +Once primary command is submitted successfully, ublk driver guarantees that
> +the ublk request buffer won't be gone away since secondary request actually
> +grabs the buffer's reference. This way also guarantees that multiple
> +concurrent fused commands associated with same request buffer works fine,
> +as the provided buffer reference is shared & read-only.
>
> +Also buffer usage direction flag is passed to primary command from userspace,
> +so ublk driver can validate if it is legal to use buffer with requested
> +direction.
>
> References
> ==========
> @@ -323,4 +431,4 @@ References
>
> .. [#stefan] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
>
> -.. [#xiaoguang] https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
> +.. [#splice_extend] https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/
> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
> index a744d3b5da91..64b0408873f6 100644
> --- a/drivers/block/ublk_drv.c
> +++ b/drivers/block/ublk_drv.c
> @@ -74,10 +74,15 @@ struct ublk_rq_data {
> * successfully
> */
> struct kref ref;
> + bool allocated_bvec;
> + struct io_uring_bvec_buf buf[0];
> };
>
> struct ublk_uring_cmd_pdu {
> - struct ublk_queue *ubq;
> + union {
> + struct ublk_queue *ubq;
> + struct request *req;
> + };
> };
>
> /*
> @@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct request *req,
> return done;
> }
>
> +/*
> + * The built command buffer is immutable, so it is fine to feed it to
> + * concurrent io_uring fused commands
> + */
> +static int ublk_init_zero_copy_buffer(struct request *rq)
> +{
> + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> + struct io_uring_bvec_buf *imu = data->buf;
> + struct req_iterator rq_iter;
> + unsigned int nr_bvecs = 0;
> + struct bio_vec *bvec;
> + unsigned int offset;
> + struct bio_vec bv;
> +
> + if (!ublk_rq_has_data(rq))
> + goto exit;
> +
> + rq_for_each_bvec(bv, rq, rq_iter)
> + nr_bvecs++;
> +
> + if (!nr_bvecs)
> + goto exit;
> +
> + if (rq->bio != rq->biotail) {
> + int idx = 0;
> +
> + bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec),
> + GFP_NOIO);
> + if (!bvec)
> + return -ENOMEM;
> +
> + offset = 0;
> + rq_for_each_bvec(bv, rq, rq_iter)
> + bvec[idx++] = bv;
> + data->allocated_bvec = true;
> + } else {
> + struct bio *bio = rq->bio;
> +
> + offset = bio->bi_iter.bi_bvec_done;
> + bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
> + }
> + imu->bvec = bvec;
> + imu->nr_bvecs = nr_bvecs;
> + imu->offset = offset;
> + imu->len = blk_rq_bytes(rq);
> +
> + return 0;
> +exit:
> + imu->bvec = NULL;
> + return 0;
> +}
> +
> +static void ublk_deinit_zero_copy_buffer(struct request *rq)
> +{
> + struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
> + struct io_uring_bvec_buf *imu = data->buf;
> +
> + if (data->allocated_bvec) {
> + kvfree(imu->bvec);
> + data->allocated_bvec = false;
> + }
> +}
> +
> static inline bool ublk_need_map_req(const struct request *req)
> {
> return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
> @@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const struct request *req)
> return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
> }
>
> -static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
> +static int ublk_map_io(const struct ublk_queue *ubq, struct request *req,
> struct ublk_io *io)
> {
> const unsigned int rq_bytes = blk_rq_bytes(req);
>
> + if (ublk_support_zc(ubq)) {
> + int ret = ublk_init_zero_copy_buffer(req);
> +
> + /*
> + * The only failure is -ENOMEM for allocating fused cmd
> + * buffer, return zero so that we can requeue this req.
> + */
> + if (unlikely(ret))
> + return 0;
> + return rq_bytes;
> + }
> +
> /*
> * no zero copy, we delay copy WRITE request data into ublksrv
> * context and the big benefit is that pinning pages in current
> @@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req,
> }
>
> static int ublk_unmap_io(const struct ublk_queue *ubq,
> - const struct request *req,
> + struct request *req,
> struct ublk_io *io)
> {
> const unsigned int rq_bytes = blk_rq_bytes(req);
>
> + if (ublk_support_zc(ubq)) {
> + ublk_deinit_zero_copy_buffer(req);
> +
> + return rq_bytes;
> + }
> +
> if (ublk_need_unmap_req(req)) {
> struct iov_iter iter;
> struct iovec iov;
> @@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu *ublk_get_uring_cmd_pdu(
> return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
> }
>
> +static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
> + struct io_uring_cmd *ioucmd)
> +{
> + return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
> +}
> +
> static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
> {
> return ubq->ubq_daemon->flags & PF_EXITING;
> @@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct request *req)
>
> return;
> exit:
> + ublk_deinit_zero_copy_buffer(req);
> blk_mq_end_request(req, res);
> }
>
> @@ -1352,6 +1445,68 @@ static inline struct request *__ublk_check_and_get_req(struct ublk_device *ub,
> return NULL;
> }
>
> +static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd,
> + unsigned issue_flags)
> +{
> + struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
> + struct request *req = pdu->req;
> + struct ublk_queue *ubq = req->mq_hctx->driver_data;
> +
> + ublk_put_req_ref(ubq, req);
> + io_uring_cmd_done(cmd, 0, 0, issue_flags);
> +}
> +
> +static inline bool ublk_check_fused_buf_dir(const struct request *req,
> + unsigned int flags)
> +{
> + flags &= IO_URING_F_FUSED_BUF;
~IO_URING_F_FUSED_BUF ?