Re: [PATCH V6 17/17] block: ublk_drv: apply io_uring FUSED_CMD for supporting zero copy

From: Bernd Schubert
Date: Mon Apr 03 2023 - 05:25:27 EST


On 3/31/23 21:55, Bernd Schubert wrote:
> On 3/30/23 13:36, Ming Lei wrote:
>> Apply io_uring fused command for supporting zero copy:
>>
>> 1) init the fused cmd buffer(io_mapped_buf) in ublk_map_io(), and
>> deinit it
>> in ublk_unmap_io(), and this buffer is immutable, so it is just fine to
>> retrieve it from concurrent fused command.
>>
>> 1) add sub-command opcode of UBLK_IO_FUSED_SUBMIT_IO for retrieving this
>> fused cmd(zero copy) buffer
>>
>> 2) call io_fused_cmd_start_secondary_req() to provide buffer to secondary
>> request and submit secondary request; meantime setup complete callback
>> via
>> this API, once secondary request is completed, the complete callback is
>> called for freeing the buffer and completing the fused command
>>
>> Also request reference is held during fused command lifetime, and this
>> way
>> guarantees that request buffer won't be freed until all inflight fused
>> commands are completed.
>>
>> userspace(only implement sqe128 fused command):
>>
>>     https://github.com/ming1/ubdsrv/tree/fused-cmd-zc-for-v6
>>
>> liburing test(only implement normal sqe fused command: two 64byte SQEs)
>>
>>     https://github.com/ming1/liburing/tree/fused_cmd_miniublk_for_v6
>>
>> Signed-off-by: Ming Lei <ming.lei@xxxxxxxxxx>
>> ---
>>   Documentation/block/ublk.rst  | 126 ++++++++++++++++++++--
>>   drivers/block/ublk_drv.c      | 192 ++++++++++++++++++++++++++++++++--
>>   include/uapi/linux/ublk_cmd.h |   6 +-
>>   3 files changed, 303 insertions(+), 21 deletions(-)
>>
>> diff --git a/Documentation/block/ublk.rst b/Documentation/block/ublk.rst
>> index 1713b2890abb..7b7aa24e9729 100644
>> --- a/Documentation/block/ublk.rst
>> +++ b/Documentation/block/ublk.rst
>> @@ -297,18 +297,126 @@ with specified IO tag in the command data:
>>     ``UBLK_IO_COMMIT_AND_FETCH_REQ`` to the server, ublkdrv needs to copy
>>     the server buffer (pages) read to the IO request pages.
>> -Future development
>> -==================
>> +- ``UBLK_IO_FUSED_SUBMIT_IO``
>> +
>> +  Used for implementing zero copy feature.
>> +
>> +  It has to been the primary command of io_uring fused command. This
>> command
>> +  submits the generic secondary IO request with io buffer provided by
>> our primary
>> +  command, and won't be completed until the secondary request is done.
>> +
>> +  The provided buffer is represented as ``io_uring_bvec_buf``, which is
>> +  actually ublk request buffer's reference, and the reference is
>> shared &
>> +  read-only, so the generic secondary request can retrieve any part
>> of the buffer
>> +  by passing buffer offset & length.
>>   Zero copy
>> ----------
>> +=========
>> +
>> +What is zero copy?
>> +------------------
>> +
>> +When application submits IO to ``/dev/ublkb*``, userspace
>> buffer(direct io)
>> +or page cache buffer(buffered io) or kernel buffer(meta io often) is
>> used
>> +for submitting data to ublk driver, and all kinds of these buffers are
>> +represented by bio/bvecs(ublk request buffer) finally. Before supporting
>> +zero copy, data in these buffers has to be copied to ublk server
>> userspace
>> +buffer before handling WRITE IO, or after handing READ IO, so that ublk
>> +server can handle IO for ``/dev/ublkb*`` with the copied data.
>> +
>> +The extra copy between ublk request buffer and ublk server userspace
>> buffer
>> +not only increases CPU utilization(such as pinning pages, copy data),
>> but
>> +also consumes memory bandwidth, and the cost could be very big when
>> IO size
>> +is big. It is observed that ublk-null IOPS may be increased to ~5X if
>> the
>> +extra copy can be avoided.
>> +
>> +So zero copy is very important for supporting high performance block
>> device
>> +in userspace.
>> +
>> +Technical requirements
>> +----------------------
>> +
>> +- ublk request buffer use
>> +
>> +ublk request buffer is represented by bio/bvec, which is immutable,
>> so do
>> +not try to change bvec via buffer reference; data can be read from or
>> +written to the buffer according to buffer direction, but bvec can't be
>> +changed
>> +
>> +- buffer lifetime
>> +
>> +Ublk server borrows ublk request buffer for handling ublk IO, ublk
>> request
>> +buffer reference is used. Reference can't outlive the referent
>> buffer. That
>> +means all request buffer references have to be released by ublk server
>> +before ublk driver completes this request, when request buffer ownership
>> +is transferred to upper layer(FS, application, ...).
>> +
>> +Also after ublk request is completed, any page belonging to this ublk
>> +request can not be written or read any more from ublk server since it is
>> +one block device from kernel viewpoint.
>> +
>> +- buffer direction
>> +
>> +For ublk WRITE request, ublk request buffer should only be accessed
>> as data
>> +source, and the buffer can't be written by ublk server
>> +
>> +For ublk READ request, ublk request buffer should only be accessed as
>> data
>> +destination, and the buffer can't be read by ublk server, otherwise
>> kernel
>> +data is leaked to ublk server, which can be unprivileged application.
>> +
>> +- arbitrary size sub-buffer needs to be retrieved from ublk server
>> +
>> +ublk is one generic framework for implementing block device in
>> userspace,
>> +and typical requirements include logical volume manager(mirror,
>> stripped, ...),
>> +distributed network storage, compressed target, ...
>> +
>> +ublk server needs to retrieve arbitrary size sub-buffer of ublk
>> request, and
>> +ublk server needs to submit IOs with these sub-buffer(s). That also
>> means
>> +arbitrary size sub-buffer(s) can be used to submit IO multiple times.
>> +
>> +Any sub-buffer is actually one reference of ublk request buffer, which
>> +ownership can't be transferred to upper layer if any reference is held
>> +by ublk server.
>> +
>> +Why slice isn't good for ublk zero copy
>> +---------------------------------------
>> +
>> +- spliced page from ->splice_read() can't be written
>> +
>> +ublk READ request can't be handled because spliced page can't be
>> written to, and
>> +extending splice for ublk zero copy isn't one good solution
>> [#splice_extend]_
>> +
>> +- it is very hard to meet above requirements  wrt. request buffer
>> lifetime
>> +
>> +splice/pipe focuses on page reference lifetime, but ublk zero copy
>> pays more
>> +attention to ublk request buffer lifetime. If is very inefficient to
>> respect
>> +request buffer lifetime by using all pipe buffer's ->release() which
>> requires
>> +all pipe buffers and pipe to be kept when ublk server handles IO.
>> That means
>> +one single dedicated ``pipe_inode_info`` has to be allocated runtime
>> for each
>> +provided buffer, and the pipe needs to be populated with pages in
>> ublk request
>> +buffer.
>> +
>> +
>> +io_uring fused command based zero copy
>> +--------------------------------------
>> +
>> +io_uring fused command includes one primary command(uring command)
>> and one
>> +generic secondary request. The primary command is responsible for
>> submitting
>> +secondary request with provided buffer from ublk request, and primary
>> command
>> +won't be completed until the secondary request is completed.
>> +
>> +Typical ublk IO handling includes network and FS IO, so it is usual
>> enough
>> +for io_uring net & fs to support IO with provided buffer from primary
>> command.
>> -Zero copy is a generic requirement for nbd, fuse or similar drivers. A
>> -problem [#xiaoguang]_ Xiaoguang mentioned is that pages mapped to
>> userspace
>> -can't be remapped any more in kernel with existing mm interfaces.
>> This can
>> -occurs when destining direct IO to ``/dev/ublkb*``. Also, he reported
>> that
>> -big requests (IO size >= 256 KB) may benefit a lot from zero copy.
>> +Once primary command is submitted successfully, ublk driver
>> guarantees that
>> +the ublk request buffer won't be gone away since secondary request
>> actually
>> +grabs the buffer's reference. This way also guarantees that multiple
>> +concurrent fused commands associated with same request buffer works
>> fine,
>> +as the provided buffer reference is shared & read-only.
>> +Also buffer usage direction flag is passed to primary command from
>> userspace,
>> +so ublk driver can validate if it is legal to use buffer with requested
>> +direction.
>>   References
>>   ==========
>> @@ -323,4 +431,4 @@ References
>>   .. [#stefan]
>> https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
>> -.. [#xiaoguang]
>> https://lore.kernel.org/linux-block/YoOr6jBfgVm8GvWg@stefanha-x1.localdomain/
>> +.. [#splice_extend]
>> https://lore.kernel.org/linux-block/CAHk-=wgJsi7t7YYpuo6ewXGnHz2nmj67iWR6KPGoz5TBu34mWQ@xxxxxxxxxxxxxx/
>> diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c
>> index a744d3b5da91..64b0408873f6 100644
>> --- a/drivers/block/ublk_drv.c
>> +++ b/drivers/block/ublk_drv.c
>> @@ -74,10 +74,15 @@ struct ublk_rq_data {
>>        *   successfully
>>        */
>>       struct kref ref;
>> +    bool allocated_bvec;
>> +    struct io_uring_bvec_buf buf[0];
>>   };
>>   struct ublk_uring_cmd_pdu {
>> -    struct ublk_queue *ubq;
>> +    union {
>> +        struct ublk_queue *ubq;
>> +        struct request *req;
>> +    };
>>   };
>>   /*
>> @@ -565,6 +570,69 @@ static size_t ublk_copy_user_pages(const struct
>> request *req,
>>       return done;
>>   }
>> +/*
>> + * The built command buffer is immutable, so it is fine to feed it to
>> + * concurrent io_uring fused commands
>> + */
>> +static int ublk_init_zero_copy_buffer(struct request *rq)
>> +{
>> +    struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
>> +    struct io_uring_bvec_buf *imu = data->buf;
>> +    struct req_iterator rq_iter;
>> +    unsigned int nr_bvecs = 0;
>> +    struct bio_vec *bvec;
>> +    unsigned int offset;
>> +    struct bio_vec bv;
>> +
>> +    if (!ublk_rq_has_data(rq))
>> +        goto exit;
>> +
>> +    rq_for_each_bvec(bv, rq, rq_iter)
>> +        nr_bvecs++;
>> +
>> +    if (!nr_bvecs)
>> +        goto exit;
>> +
>> +    if (rq->bio != rq->biotail) {
>> +        int idx = 0;
>> +
>> +        bvec = kvmalloc_array(nr_bvecs, sizeof(struct bio_vec),
>> +                GFP_NOIO);
>> +        if (!bvec)
>> +            return -ENOMEM;
>> +
>> +        offset = 0;
>> +        rq_for_each_bvec(bv, rq, rq_iter)
>> +            bvec[idx++] = bv;
>> +        data->allocated_bvec = true;
>> +    } else {
>> +        struct bio *bio = rq->bio;
>> +
>> +        offset = bio->bi_iter.bi_bvec_done;
>> +        bvec = __bvec_iter_bvec(bio->bi_io_vec, bio->bi_iter);
>> +    }
>> +    imu->bvec = bvec;
>> +    imu->nr_bvecs = nr_bvecs;
>> +    imu->offset = offset;
>> +    imu->len = blk_rq_bytes(rq);
>> +
>> +    return 0;
>> +exit:
>> +    imu->bvec = NULL;
>> +    return 0;
>> +}
>> +
>> +static void ublk_deinit_zero_copy_buffer(struct request *rq)
>> +{
>> +    struct ublk_rq_data *data = blk_mq_rq_to_pdu(rq);
>> +    struct io_uring_bvec_buf *imu = data->buf;
>> +
>> +    if (data->allocated_bvec) {
>> +        kvfree(imu->bvec);
>> +        data->allocated_bvec = false;
>> +    }
>> +}
>> +
>>   static inline bool ublk_need_map_req(const struct request *req)
>>   {
>>       return ublk_rq_has_data(req) && req_op(req) == REQ_OP_WRITE;
>> @@ -575,11 +643,23 @@ static inline bool ublk_need_unmap_req(const
>> struct request *req)
>>       return ublk_rq_has_data(req) && req_op(req) == REQ_OP_READ;
>>   }
>> -static int ublk_map_io(const struct ublk_queue *ubq, const struct
>> request *req,
>> +static int ublk_map_io(const struct ublk_queue *ubq, struct request
>> *req,
>>           struct ublk_io *io)
>>   {
>>       const unsigned int rq_bytes = blk_rq_bytes(req);
>> +    if (ublk_support_zc(ubq)) {
>> +        int ret = ublk_init_zero_copy_buffer(req);
>> +
>> +        /*
>> +         * The only failure is -ENOMEM for allocating fused cmd
>> +         * buffer, return zero so that we can requeue this req.
>> +         */
>> +        if (unlikely(ret))
>> +            return 0;
>> +        return rq_bytes;
>> +    }
>> +
>>       /*
>>        * no zero copy, we delay copy WRITE request data into ublksrv
>>        * context and the big benefit is that pinning pages in current
>> @@ -599,11 +679,17 @@ static int ublk_map_io(const struct ublk_queue
>> *ubq, const struct request *req,
>>   }
>>   static int ublk_unmap_io(const struct ublk_queue *ubq,
>> -        const struct request *req,
>> +        struct request *req,
>>           struct ublk_io *io)
>>   {
>>       const unsigned int rq_bytes = blk_rq_bytes(req);
>> +    if (ublk_support_zc(ubq)) {
>> +        ublk_deinit_zero_copy_buffer(req);
>> +
>> +        return rq_bytes;
>> +    }
>> +
>>       if (ublk_need_unmap_req(req)) {
>>           struct iov_iter iter;
>>           struct iovec iov;
>> @@ -687,6 +773,12 @@ static inline struct ublk_uring_cmd_pdu
>> *ublk_get_uring_cmd_pdu(
>>       return (struct ublk_uring_cmd_pdu *)&ioucmd->pdu;
>>   }
>> +static inline struct ublk_uring_cmd_pdu *ublk_get_uring_fused_cmd_pdu(
>> +        struct io_uring_cmd *ioucmd)
>> +{
>> +    return (struct ublk_uring_cmd_pdu *)&ioucmd->fused.pdu;
>> +}
>> +
>>   static inline bool ubq_daemon_is_dying(struct ublk_queue *ubq)
>>   {
>>       return ubq->ubq_daemon->flags & PF_EXITING;
>> @@ -742,6 +834,7 @@ static inline void __ublk_complete_rq(struct
>> request *req)
>>       return;
>>   exit:
>> +    ublk_deinit_zero_copy_buffer(req);
>>       blk_mq_end_request(req, res);
>>   }
>> @@ -1352,6 +1445,68 @@ static inline struct request
>> *__ublk_check_and_get_req(struct ublk_device *ub,
>>       return NULL;
>>   }
>> +static void ublk_fused_cmd_done_cb(struct io_uring_cmd *cmd,
>> +        unsigned issue_flags)
>> +{
>> +    struct ublk_uring_cmd_pdu *pdu = ublk_get_uring_fused_cmd_pdu(cmd);
>> +    struct request *req = pdu->req;
>> +    struct ublk_queue *ubq = req->mq_hctx->driver_data;
>> +
>> +    ublk_put_req_ref(ubq, req);
>> +    io_uring_cmd_done(cmd, 0, 0, issue_flags);
>> +}
>> +
>> +static inline bool ublk_check_fused_buf_dir(const struct request *req,
>> +        unsigned int flags)
>> +{
>> +    flags &= IO_URING_F_FUSED_BUF;
>
> ~IO_URING_F_FUSED_BUF ?
>

Ah sorry for the noise, got confused because the caller
ublk_handle_fused_cmd checks for the flag with '&' and then
ublk_check_fused_buf_dir checks with '=' - I thought intend was to
remove the flag.