Re: [PATCH V4 3/3] virtio-blk: Add bio-based IO path for virtio-blk

From: Michael S. Tsirkin
Date: Sun Jul 29 2012 - 07:11:29 EST


On Sat, Jul 28, 2012 at 10:21:05AM +0800, Asias He wrote:
> This patch introduces bio-based IO path for virtio-blk.
>
> Compared to request-based IO path, bio-based IO path uses driver
> provided ->make_request_fn() method to bypasses the IO scheduler. It
> handles the bio to device directly without allocating a request in block
> layer. This reduces the IO path in guest kernel to achieve high IOPS
> and lower latency. The downside is that guest can not use the IO
> scheduler to merge and sort requests. However, this is not a big problem
> if the backend disk in host side uses faster disk device.

If this optimization depends on the host, then it
should be reported to the guest using a feature bit,
as opposed to being guest driven.

> When the bio-based IO path is not enabled, virtio-blk still uses the
> original request-based IO path, no performance difference is observed.
>
> Performance evaluation:
> -----------------------------
> 1) Fio test is performed in a 8 vcpu guest with ramdisk based guest using
> kvm tool.
>
> Short version:
> With bio-based IO path, sequential read/write, random read/write
> IOPS boost : 28%, 24%, 21%, 16%
> Latency improvement: 32%, 17%, 21%, 16%
>
> Long version:
> With bio-based IO path:
> seq-read : io=2048.0MB, bw=116996KB/s, iops=233991 , runt= 17925msec
> seq-write : io=2048.0MB, bw=100829KB/s, iops=201658 , runt= 20799msec
> rand-read : io=3095.7MB, bw=112134KB/s, iops=224268 , runt= 28269msec
> rand-write: io=3095.7MB, bw=96198KB/s, iops=192396 , runt= 32952msec
> clat (usec): min=0 , max=2631.6K, avg=58716.99, stdev=191377.30
> clat (usec): min=0 , max=1753.2K, avg=66423.25, stdev=81774.35
> clat (usec): min=0 , max=2915.5K, avg=61685.70, stdev=120598.39
> clat (usec): min=0 , max=1933.4K, avg=76935.12, stdev=96603.45
> cpu : usr=74.08%, sys=703.84%, ctx=29661403, majf=21354, minf=22460954
> cpu : usr=70.92%, sys=702.81%, ctx=77219828, majf=13980, minf=27713137
> cpu : usr=72.23%, sys=695.37%, ctx=88081059, majf=18475, minf=28177648
> cpu : usr=69.69%, sys=654.13%, ctx=145476035, majf=15867, minf=26176375
> With request-based IO path:
> seq-read : io=2048.0MB, bw=91074KB/s, iops=182147 , runt= 23027msec
> seq-write : io=2048.0MB, bw=80725KB/s, iops=161449 , runt= 25979msec
> rand-read : io=3095.7MB, bw=92106KB/s, iops=184211 , runt= 34416msec
> rand-write: io=3095.7MB, bw=82815KB/s, iops=165630 , runt= 38277msec
> clat (usec): min=0 , max=1932.4K, avg=77824.17, stdev=170339.49
> clat (usec): min=0 , max=2510.2K, avg=78023.96, stdev=146949.15
> clat (usec): min=0 , max=3037.2K, avg=74746.53, stdev=128498.27
> clat (usec): min=0 , max=1363.4K, avg=89830.75, stdev=114279.68
> cpu : usr=53.28%, sys=724.19%, ctx=37988895, majf=17531, minf=23577622
> cpu : usr=49.03%, sys=633.20%, ctx=205935380, majf=18197, minf=27288959
> cpu : usr=55.78%, sys=722.40%, ctx=101525058, majf=19273, minf=28067082
> cpu : usr=56.55%, sys=690.83%, ctx=228205022, majf=18039, minf=26551985


So bio based causes cpu to jump up by some 30%?
What happens if you divide IOPS/CPU?
Any improvement that comes from increasing the cpu share
of the given guest on the host will not scale well on
a typical overcommitted host.

> 2) Fio test is performed in a 8 vcpu guest with Fusion-IO based guest using
> kvm tool.
>
> Short version:
> With bio-based IO path, sequential read/write, random read/write
> IOPS boost : 11%, 11%, 13%, 10%
> Latency improvement: 10%, 10%, 12%, 10%
> Long Version:
> With bio-based IO path:
> read : io=2048.0MB, bw=58920KB/s, iops=117840 , runt= 35593msec
> write: io=2048.0MB, bw=64308KB/s, iops=128616 , runt= 32611msec
> read : io=3095.7MB, bw=59633KB/s, iops=119266 , runt= 53157msec
> write: io=3095.7MB, bw=62993KB/s, iops=125985 , runt= 50322msec
> clat (usec): min=0 , max=1284.3K, avg=128109.01, stdev=71513.29
> clat (usec): min=94 , max=962339 , avg=116832.95, stdev=65836.80
> clat (usec): min=0 , max=1846.6K, avg=128509.99, stdev=89575.07
> clat (usec): min=0 , max=2256.4K, avg=121361.84, stdev=82747.25
> cpu : usr=56.79%, sys=421.70%, ctx=147335118, majf=21080, minf=19852517
> cpu : usr=61.81%, sys=455.53%, ctx=143269950, majf=16027, minf=24800604
> cpu : usr=63.10%, sys=455.38%, ctx=178373538, majf=16958, minf=24822612
> cpu : usr=62.04%, sys=453.58%, ctx=226902362, majf=16089, minf=23278105
> With request-based IO path:
> read : io=2048.0MB, bw=52896KB/s, iops=105791 , runt= 39647msec
> write: io=2048.0MB, bw=57856KB/s, iops=115711 , runt= 36248msec
> read : io=3095.7MB, bw=52387KB/s, iops=104773 , runt= 60510msec
> write: io=3095.7MB, bw=57310KB/s, iops=114619 , runt= 55312msec
> clat (usec): min=0 , max=1532.6K, avg=142085.62, stdev=109196.84
> clat (usec): min=0 , max=1487.4K, avg=129110.71, stdev=114973.64
> clat (usec): min=0 , max=1388.6K, avg=145049.22, stdev=107232.55
> clat (usec): min=0 , max=1465.9K, avg=133585.67, stdev=110322.95
> cpu : usr=44.08%, sys=590.71%, ctx=451812322, majf=14841, minf=17648641
> cpu : usr=48.73%, sys=610.78%, ctx=418953997, majf=22164, minf=26850689
> cpu : usr=45.58%, sys=581.16%, ctx=714079216, majf=21497, minf=22558223
> cpu : usr=48.40%, sys=599.65%, ctx=656089423, majf=16393, minf=23824409


Is this host or guest cpu? We should probably measure host cpu
as this includes device overhead which could vary by load.

> How to use:
> -----------------------------
> Add 'virtio_blk.use_bio=1' to kernel cmdline or 'modprobe virtio_blk
> use_bio=1' to enable ->make_request_fn() based I/O path.
>
> Cc: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
> Cc: "Michael S. Tsirkin" <mst@xxxxxxxxxx>
> Cc: virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Cc: kvm@xxxxxxxxxxxxxxx
> Cc: linux-kernel@xxxxxxxxxxxxxxx
> Acked-by: Rusty Russell <rusty@xxxxxxxxxxxxxxx>
> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
> Signed-off-by: Minchan Kim <minchan.kim@xxxxxxxxx>
> Signed-off-by: Asias He <asias@xxxxxxxxxx>
> ---
> drivers/block/virtio_blk.c | 203 +++++++++++++++++++++++++++++++++++---------
> 1 file changed, 163 insertions(+), 40 deletions(-)
>
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index c0bbeb4..95cfeed 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -14,6 +14,9 @@
>
> #define PART_BITS 4
>
> +static bool use_bio;
> +module_param(use_bio, bool, S_IRUGO);
> +
> static int major;
> static DEFINE_IDA(vd_index_ida);
>
> @@ -23,6 +26,7 @@ struct virtio_blk
> {
> struct virtio_device *vdev;
> struct virtqueue *vq;
> + wait_queue_head_t queue_wait;
>
> /* The disk structure for the kernel. */
> struct gendisk *disk;
> @@ -51,53 +55,87 @@ struct virtio_blk
> struct virtblk_req
> {
> struct request *req;
> + struct bio *bio;
> struct virtio_blk_outhdr out_hdr;
> struct virtio_scsi_inhdr in_hdr;
> u8 status;
> + struct scatterlist sg[];
> };
>
> -static void blk_done(struct virtqueue *vq)
> +static inline int virtblk_result(struct virtblk_req *vbr)
> +{
> + switch (vbr->status) {
> + case VIRTIO_BLK_S_OK:
> + return 0;
> + case VIRTIO_BLK_S_UNSUPP:
> + return -ENOTTY;
> + default:
> + return -EIO;
> + }
> +}
> +
> +static inline void virtblk_request_done(struct virtio_blk *vblk,
> + struct virtblk_req *vbr)
> +{
> + struct request *req = vbr->req;
> + int error = virtblk_result(vbr);
> +
> + if (req->cmd_type == REQ_TYPE_BLOCK_PC) {
> + req->resid_len = vbr->in_hdr.residual;
> + req->sense_len = vbr->in_hdr.sense_len;
> + req->errors = vbr->in_hdr.errors;
> + } else if (req->cmd_type == REQ_TYPE_SPECIAL) {
> + req->errors = (error != 0);
> + }
> +
> + __blk_end_request_all(req, error);
> + mempool_free(vbr, vblk->pool);
> +}
> +
> +static inline void virtblk_bio_done(struct virtio_blk *vblk,
> + struct virtblk_req *vbr)
> +{
> + bio_endio(vbr->bio, virtblk_result(vbr));
> + mempool_free(vbr, vblk->pool);
> +}
> +
> +static void virtblk_done(struct virtqueue *vq)
> {
> struct virtio_blk *vblk = vq->vdev->priv;
> + unsigned long bio_done = 0, req_done = 0;
> struct virtblk_req *vbr;
> - unsigned int len;
> unsigned long flags;
> + unsigned int len;
>
> spin_lock_irqsave(vblk->disk->queue->queue_lock, flags);
> while ((vbr = virtqueue_get_buf(vblk->vq, &len)) != NULL) {
> - int error;
> -
> - switch (vbr->status) {
> - case VIRTIO_BLK_S_OK:
> - error = 0;
> - break;
> - case VIRTIO_BLK_S_UNSUPP:
> - error = -ENOTTY;
> - break;
> - default:
> - error = -EIO;
> - break;
> - }
> -
> - switch (vbr->req->cmd_type) {
> - case REQ_TYPE_BLOCK_PC:
> - vbr->req->resid_len = vbr->in_hdr.residual;
> - vbr->req->sense_len = vbr->in_hdr.sense_len;
> - vbr->req->errors = vbr->in_hdr.errors;
> - break;
> - case REQ_TYPE_SPECIAL:
> - vbr->req->errors = (error != 0);
> - break;
> - default:
> - break;
> + if (vbr->bio) {
> + virtblk_bio_done(vblk, vbr);
> + bio_done++;
> + } else {
> + virtblk_request_done(vblk, vbr);
> + req_done++;
> }
> -
> - __blk_end_request_all(vbr->req, error);
> - mempool_free(vbr, vblk->pool);
> }
> /* In case queue is stopped waiting for more buffers. */
> - blk_start_queue(vblk->disk->queue);
> + if (req_done)
> + blk_start_queue(vblk->disk->queue);
> spin_unlock_irqrestore(vblk->disk->queue->queue_lock, flags);
> +
> + if (bio_done)
> + wake_up(&vblk->queue_wait);
> +}
> +
> +static inline struct virtblk_req *virtblk_alloc_req(struct virtio_blk *vblk,
> + gfp_t gfp_mask)
> +{
> + struct virtblk_req *vbr;
> +
> + vbr = mempool_alloc(vblk->pool, gfp_mask);
> + if (vbr && use_bio)
> + sg_init_table(vbr->sg, vblk->sg_elems);
> +
> + return vbr;
> }
>
> static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
> @@ -106,13 +144,13 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
> unsigned long num, out = 0, in = 0;
> struct virtblk_req *vbr;
>
> - vbr = mempool_alloc(vblk->pool, GFP_ATOMIC);
> + vbr = virtblk_alloc_req(vblk, GFP_ATOMIC);
> if (!vbr)
> /* When another request finishes we'll try again. */
> return false;
>
> vbr->req = req;
> -
> + vbr->bio = NULL;
> if (req->cmd_flags & REQ_FLUSH) {
> vbr->out_hdr.type = VIRTIO_BLK_T_FLUSH;
> vbr->out_hdr.sector = 0;
> @@ -172,7 +210,8 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
> }
> }
>
> - if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr, GFP_ATOMIC)<0) {
> + if (virtqueue_add_buf(vblk->vq, vblk->sg, out, in, vbr,
> + GFP_ATOMIC) < 0) {
> mempool_free(vbr, vblk->pool);
> return false;
> }
> @@ -180,7 +219,7 @@ static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
> return true;
> }
>
> -static void do_virtblk_request(struct request_queue *q)
> +static void virtblk_request(struct request_queue *q)
> {
> struct virtio_blk *vblk = q->queuedata;
> struct request *req;
> @@ -203,6 +242,82 @@ static void do_virtblk_request(struct request_queue *q)
> virtqueue_kick(vblk->vq);
> }
>
> +static void virtblk_add_buf_wait(struct virtio_blk *vblk,
> + struct virtblk_req *vbr,
> + unsigned long out,
> + unsigned long in)
> +{
> + DEFINE_WAIT(wait);
> +
> + for (;;) {
> + prepare_to_wait_exclusive(&vblk->queue_wait, &wait,
> + TASK_UNINTERRUPTIBLE);
> +
> + spin_lock_irq(vblk->disk->queue->queue_lock);
> + if (virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
> + GFP_ATOMIC) < 0) {
> + spin_unlock_irq(vblk->disk->queue->queue_lock);
> + io_schedule();
> + } else {
> + virtqueue_kick(vblk->vq);
> + spin_unlock_irq(vblk->disk->queue->queue_lock);
> + break;
> + }
> +
> + }
> +
> + finish_wait(&vblk->queue_wait, &wait);
> +}
> +
> +static void virtblk_make_request(struct request_queue *q, struct bio *bio)
> +{
> + struct virtio_blk *vblk = q->queuedata;
> + unsigned int num, out = 0, in = 0;
> + struct virtblk_req *vbr;
> +
> + BUG_ON(bio->bi_phys_segments + 2 > vblk->sg_elems);
> + BUG_ON(bio->bi_rw & (REQ_FLUSH | REQ_FUA));
> +
> + vbr = virtblk_alloc_req(vblk, GFP_NOIO);
> + if (!vbr) {
> + bio_endio(bio, -ENOMEM);
> + return;
> + }
> +
> + vbr->bio = bio;
> + vbr->req = NULL;
> + vbr->out_hdr.type = 0;
> + vbr->out_hdr.sector = bio->bi_sector;
> + vbr->out_hdr.ioprio = bio_prio(bio);
> +
> + sg_set_buf(&vbr->sg[out++], &vbr->out_hdr, sizeof(vbr->out_hdr));
> +
> + num = blk_bio_map_sg(q, bio, vbr->sg + out);
> +
> + sg_set_buf(&vbr->sg[num + out + in++], &vbr->status,
> + sizeof(vbr->status));
> +
> + if (num) {
> + if (bio->bi_rw & REQ_WRITE) {
> + vbr->out_hdr.type |= VIRTIO_BLK_T_OUT;
> + out += num;
> + } else {
> + vbr->out_hdr.type |= VIRTIO_BLK_T_IN;
> + in += num;
> + }
> + }
> +
> + spin_lock_irq(vblk->disk->queue->queue_lock);
> + if (unlikely(virtqueue_add_buf(vblk->vq, vbr->sg, out, in, vbr,
> + GFP_ATOMIC) < 0)) {
> + spin_unlock_irq(vblk->disk->queue->queue_lock);
> + virtblk_add_buf_wait(vblk, vbr, out, in);
> + return;
> + }
> + virtqueue_kick(vblk->vq);
> + spin_unlock_irq(vblk->disk->queue->queue_lock);
> +}
> +
> /* return id (s/n) string for *disk to *id_str
> */
> static int virtblk_get_id(struct gendisk *disk, char *id_str)
> @@ -360,7 +475,7 @@ static int init_vq(struct virtio_blk *vblk)
> int err = 0;
>
> /* We expect one virtqueue, for output. */
> - vblk->vq = virtio_find_single_vq(vblk->vdev, blk_done, "requests");
> + vblk->vq = virtio_find_single_vq(vblk->vdev, virtblk_done, "requests");
> if (IS_ERR(vblk->vq))
> err = PTR_ERR(vblk->vq);
>
> @@ -414,7 +529,7 @@ static void virtblk_update_cache_mode(struct virtio_device *vdev)
> u8 writeback = virtblk_get_cache_mode(vdev);
> struct virtio_blk *vblk = vdev->priv;
>
> - if (writeback)
> + if (writeback && !use_bio)
> blk_queue_flush(vblk->disk->queue, REQ_FLUSH);
> else
> blk_queue_flush(vblk->disk->queue, 0);
> @@ -477,6 +592,8 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
> struct virtio_blk *vblk;
> struct request_queue *q;
> int err, index;
> + int pool_size;
> +
> u64 cap;
> u32 v, blk_size, sg_elems, opt_io_size;
> u16 min_io_size;
> @@ -506,10 +623,12 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
> goto out_free_index;
> }
>
> + init_waitqueue_head(&vblk->queue_wait);
> vblk->vdev = vdev;
> vblk->sg_elems = sg_elems;
> sg_init_table(vblk->sg, vblk->sg_elems);
> mutex_init(&vblk->config_lock);
> +
> INIT_WORK(&vblk->config_work, virtblk_config_changed_work);
> vblk->config_enable = true;
>
> @@ -517,7 +636,10 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
> if (err)
> goto out_free_vblk;
>
> - vblk->pool = mempool_create_kmalloc_pool(1,sizeof(struct virtblk_req));
> + pool_size = sizeof(struct virtblk_req);
> + if (use_bio)
> + pool_size += sizeof(struct scatterlist) * sg_elems;
> + vblk->pool = mempool_create_kmalloc_pool(1, pool_size);
> if (!vblk->pool) {
> err = -ENOMEM;
> goto out_free_vq;
> @@ -530,12 +652,14 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
> goto out_mempool;
> }
>
> - q = vblk->disk->queue = blk_init_queue(do_virtblk_request, NULL);
> + q = vblk->disk->queue = blk_init_queue(virtblk_request, NULL);
> if (!q) {
> err = -ENOMEM;
> goto out_put_disk;
> }
>
> + if (use_bio)
> + blk_queue_make_request(q, virtblk_make_request);
> q->queuedata = vblk;
>
> virtblk_name_format("vd", index, vblk->disk->disk_name, DISK_NAME_LEN);
> @@ -620,7 +744,6 @@ static int __devinit virtblk_probe(struct virtio_device *vdev)
> if (!err && opt_io_size)
> blk_queue_io_opt(q, blk_size * opt_io_size);
>
> -
> add_disk(vblk->disk);
> err = device_create_file(disk_to_dev(vblk->disk), &dev_attr_serial);
> if (err)
> --
> 1.7.10.4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/