[PATCH] iosched: Add i10 I/O Scheduler
From: Rachit Agarwal
Date: Thu Nov 12 2020 - 09:08:13 EST
From: Rachit Agarwal <rach4x0r@xxxxxxxxx>>
Hi All,
I/O batching is beneficial for optimizing IOPS and throughput for various
applications. For instance, several kernel block drivers would benefit from batching,
including mmc [1] and tcp-based storage drivers like nvme-tcp [2,3]. While we have
support for batching dispatch [4], we need an I/O scheduler to efficiently enable
batching. Such a scheduler is particularly interesting for disaggregated storage,
where the access latency of remote disaggregated storage may be higher than local
storage access; thus, batching can significantly help in amortizing the remote access
latency while increasing the throughput.
This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
dispatching only when #requests or #bytes is larger than a default threshold or when
a timer expires. After that, batching dispatch [3] would happen, allowing batching
at device drivers along with "bd->last" and ".commit_rqs".
The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
and many additional results are presented in [6].
While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
nor a need for a global tracking context, so a new scheduler is needed rather than
to build this functionality to an existing scheduler.
We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
tests in [6]. For our future work, we plan to support adaptive batching according to
system load and to extend the scheduler to support isolation in multi-tenant deployments
(to simultaneously achieve low tail latency for latency-sensitive applications and high
throughput for throughput-bound applications).
References
[1] https://lore.kernel.org/linux-block/cover.1587888520.git.baolin.wang7@xxxxxxxxx/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
[4] https://lore.kernel.org/linux-block/20200630102501.2238972-1-ming.lei@xxxxxxxxxx/
[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
[6] https://www.usenix.org/conference/nsdi20/presentation/hwang
Signed-off-by: Jaehyun Hwang <jaehyun.hwang@xxxxxxxxxxx>
Signed-off-by: Qizhe Cai <qc228@xxxxxxxxxxx>
Signed-off-by: Midhul Vuppalapati <mvv25@xxxxxxxxxxx>
Signed-off-by: Rachit Agarwal <ragarwal@xxxxxxxxxxx>
Signed-off-by: Sagi Grimberg <sagi@xxxxxxxxxxxxxxxxx>
---
Documentation/block/i10-iosched.rst | 69 +++++
block/Kconfig.iosched | 8 +
block/Makefile | 1 +
block/i10-iosched.c | 421 ++++++++++++++++++++++++++++
4 files changed, 499 insertions(+)
create mode 100644 Documentation/block/i10-iosched.rst
create mode 100644 block/i10-iosched.c
diff --git a/Documentation/block/i10-iosched.rst b/Documentation/block/i10-iosched.rst
new file mode 100644
index 000000000000..3db7ca3ed4c1
--- /dev/null
+++ b/Documentation/block/i10-iosched.rst
@@ -0,0 +1,69 @@
+==========================
+i10 I/O scheduler overview
+==========================
+
+I/O batching is beneficial for optimizing IOPS and throughput for various
+applications. For instance, several kernel block drivers would
+benefit from batching, including mmc [1] and tcp-based storage drivers like
+nvme-tcp [2,3]. While we have support for batching dispatch [4], we need an
+I/O scheduler to efficiently enable batching. Such a scheduler is particularly
+interesting for disaggregated storage, where the access latency of remote
+disaggregated storage may be higher than local storage access; thus, batching
+can significantly help in amortizing the remote access latency while increasing
+the throughput.
+
+This patch introduces the i10 I/O scheduler, which performs batching per hctx in terms
+of #requests, #bytes, and timeouts (at microseconds granularity). i10 starts
+dispatching only when #requests or #bytes is larger than a default threshold or when
+a timer expires. After that, batching dispatch [3] would happen, allowing batching
+at device drivers along with "bd->last" and ".commit_rqs".
+
+The i10 I/O scheduler builds upon recent work on [6]. We have tested the i10 I/O
+scheduler with nvme-tcp optimizaitons [2,3] and batching dispatch [4], varying number
+of cores, varying read/write ratios, and varying request sizes, and with NVMe SSD and
+RAM block device. For NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in
+terms of IOPS per core over "noop" I/O scheduler. These results are available at [5],
+and many additional results are presented in [6].
+
+While other schedulers may also batch I/O (e.g., mq-deadline), the optimization target
+in the i10 I/O scheduler is throughput maximization. Hence there is no latency target
+nor a need for a global tracking context, so a new scheduler is needed rather than
+to build this functionality to an existing scheduler.
+
+We currently use fixed default values as batching thresholds (e.g., 16 for #requests,
+64KB for #bytes, and 50us for timeout). These default values are based on sensitivity
+tests in [6]. For our future work, we plan to support adaptive batching according to
+system load and to extend the scheduler to support isolation in multi-tenant deployments
+(to simultaneously achieve low tail latency for latency-sensitive applications and high
+throughput for throughput-bound applications).
+
+References
+[1] https://lore.kernel.org/linux-block/cover.1587888520.git.baolin.wang7@xxxxxxxxx/T/#mc48a8fb6069843827458f5fea722e1179d32af2a
+[2] https://git.infradead.org/nvme.git/commit/122e5b9f3d370ae11e1502d14ff5c7ea9b144a76
+[3] https://git.infradead.org/nvme.git/commit/86f0348ace1510d7ac25124b096fb88a6ab45270
+[4] https://lore.kernel.org/linux-block/20200630102501.2238972-1-ming.lei@xxxxxxxxxx/
+[5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
+[6] https://www.usenix.org/conference/nsdi20/presentation/hwang
+
+==========================
+i10 I/O scheduler tunables
+==========================
+
+The three tunables for the i10 scheduler are the number of requests for
+reads/writes, the number of bytes for writes, and a timeout value.
+i10 will use these values for batching requests.
+
+batch_nr
+--------
+Number of requests for batching read/write requests
+Default: 16
+
+batch_bytes
+-----------
+Number of bytes for batching write requests
+Default: 65536 (bytes)
+
+batch_timeout
+-------------
+Timeout value for batching (in microseconds)
+Default: 50 (us)
diff --git a/block/Kconfig.iosched b/block/Kconfig.iosched
index 2f2158e05a91..5b3623b19487 100644
--- a/block/Kconfig.iosched
+++ b/block/Kconfig.iosched
@@ -44,6 +44,14 @@ config BFQ_CGROUP_DEBUG
Enable some debugging help. Currently it exports additional stat
files in a cgroup which can be useful for debugging.
+config MQ_IOSCHED_I10
+ tristate "i10 I/O scheduler"
+ default y
+ help
+ The i10 I/O Scheduler supports batching at BLK-MQ.
+ Any device driver that benefits from batching
+ (e.g., NVMe-over-TCP) can use this scheduler.
+
endmenu
endif
diff --git a/block/Makefile b/block/Makefile
index 8d841f5f986f..27e0789589ea 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -21,6 +21,7 @@ obj-$(CONFIG_BLK_CGROUP_IOLATENCY) += blk-iolatency.o
obj-$(CONFIG_BLK_CGROUP_IOCOST) += blk-iocost.o
obj-$(CONFIG_MQ_IOSCHED_DEADLINE) += mq-deadline.o
obj-$(CONFIG_MQ_IOSCHED_KYBER) += kyber-iosched.o
+obj-$(CONFIG_MQ_IOSCHED_I10) += i10-iosched.o
bfq-y := bfq-iosched.o bfq-wf2q.o bfq-cgroup.o
obj-$(CONFIG_IOSCHED_BFQ) += bfq.o
diff --git a/block/i10-iosched.c b/block/i10-iosched.c
new file mode 100644
index 000000000000..b5451beaa66d
--- /dev/null
+++ b/block/i10-iosched.c
@@ -0,0 +1,421 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * The i10 I/O Scheduler - supports batching at blk-mq.
+ * The main use case is disaggregated storage access
+ * using NVMe-over-Fabric (e.g., NVMe-over-TCP device driver).
+ *
+ * An early version of the idea is described and evaluated in
+ * "TCP ≈ RDMA: CPU-efficient Remote Storage Access with i10",
+ * USENIX NSDI 2020.
+ *
+ * Copyright (C) 2020 Cornell University
+ * Jaehyun Hwang <jaehyun.hwang@xxxxxxxxxxx>
+ * Qizhe Cai <qc228@xxxxxxxxxxx>
+ * Midhul Vuppalapati <mvv25@xxxxxxxxxxx%>
+ * Rachit Agarwal <ragarwal@xxxxxxxxxxx>
+ */
+
+#include <linux/kernel.h>
+#include <linux/blkdev.h>
+#include <linux/blk-mq.h>
+#include <linux/elevator.h>
+#include <linux/module.h>
+#include <linux/sbitmap.h>
+
+#include "blk.h"
+#include "blk-mq.h"
+#include "blk-mq-debugfs.h"
+#include "blk-mq-sched.h"
+#include "blk-mq-tag.h"
+
+/* Default batch size in number of requests */
+#define I10_DEF_BATCH_NR 16
+/* Default batch size in bytes (for write requests) */
+#define I10_DEF_BATCH_BYTES 65536
+/* Default timeout value for batching (us units) */
+#define I10_DEF_BATCH_TIMEOUT 50
+
+enum i10_state {
+ /* Batching state:
+ * Do not run dispatching until we have
+ * a certain amount of requests or a timer expires.
+ */
+ I10_STATE_BATCH,
+
+ /* Dispatching state:
+ * Run dispatching until all requests in the
+ * scheduler's hctx ihq are dispatched.
+ */
+ I10_STATE_DISPATCH,
+};
+
+struct i10_queue_data {
+ struct request_queue *q;
+
+ unsigned int def_batch_nr;
+ unsigned int def_batch_bytes;
+ unsigned int def_batch_timeout;
+};
+
+struct i10_hctx_queue {
+ spinlock_t lock;
+ struct list_head rq_list;
+
+ struct blk_mq_hw_ctx *hctx;
+
+ unsigned int batch_nr;
+ unsigned int batch_bytes;
+ unsigned int batch_timeout;
+
+ unsigned int qlen_nr;
+ unsigned int qlen_bytes;
+
+ struct hrtimer dispatch_timer;
+ enum i10_state state;
+};
+
+static struct i10_queue_data *i10_queue_data_alloc(struct request_queue *q)
+{
+ struct i10_queue_data *iqd;
+
+ iqd = kzalloc_node(sizeof(*iqd), GFP_KERNEL, q->node);
+ if (!iqd)
+ return ERR_PTR(-ENOMEM);
+
+ iqd->q = q;
+ iqd->def_batch_nr = I10_DEF_BATCH_NR;
+ iqd->def_batch_bytes = I10_DEF_BATCH_BYTES;
+ iqd->def_batch_timeout = I10_DEF_BATCH_TIMEOUT;
+
+ return iqd;
+}
+
+static int i10_init_sched(struct request_queue *q, struct elevator_type *e)
+{
+ struct i10_queue_data *iqd;
+ struct elevator_queue *eq;
+
+ eq = elevator_alloc(q, e);
+ if (!eq)
+ return -ENOMEM;
+
+ iqd = i10_queue_data_alloc(q);
+ if (IS_ERR(iqd)) {
+ kobject_put(&eq->kobj);
+ return PTR_ERR(iqd);
+ }
+
+ blk_stat_enable_accounting(q);
+
+ eq->elevator_data = iqd;
+ q->elevator = eq;
+
+ return 0;
+}
+
+static void i10_exit_sched(struct elevator_queue *e)
+{
+ struct i10_queue_data *iqd = e->elevator_data;
+
+ kfree(iqd);
+}
+
+enum hrtimer_restart i10_hctx_timeout_handler(struct hrtimer *timer)
+{
+ struct i10_hctx_queue *ihq =
+ container_of(timer, struct i10_hctx_queue,
+ dispatch_timer);
+
+ ihq->state = I10_STATE_DISPATCH;
+ blk_mq_run_hw_queue(ihq->hctx, true);
+
+ return HRTIMER_NORESTART;
+}
+
+static void i10_hctx_queue_reset(struct i10_hctx_queue *ihq)
+{
+ ihq->qlen_nr = 0;
+ ihq->qlen_bytes = 0;
+ ihq->state = I10_STATE_BATCH;
+}
+
+static int i10_init_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+ struct i10_hctx_queue *ihq;
+
+ ihq = kzalloc_node(sizeof(*ihq), GFP_KERNEL, hctx->numa_node);
+ if (!ihq)
+ return -ENOMEM;
+
+ spin_lock_init(&ihq->lock);
+ INIT_LIST_HEAD(&ihq->rq_list);
+
+ ihq->hctx = hctx;
+ ihq->batch_nr = 0;
+ ihq->batch_bytes = 0;
+ ihq->batch_timeout = 0;
+
+ hrtimer_init(&ihq->dispatch_timer,
+ CLOCK_MONOTONIC, HRTIMER_MODE_REL);
+ ihq->dispatch_timer.function = &i10_hctx_timeout_handler;
+
+ i10_hctx_queue_reset(ihq);
+
+ hctx->sched_data = ihq;
+
+ return 0;
+}
+
+static void i10_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ hrtimer_cancel(&ihq->dispatch_timer);
+ kfree(hctx->sched_data);
+}
+
+static bool i10_hctx_bio_merge(struct blk_mq_hw_ctx *hctx, struct bio *bio,
+ unsigned int nr_segs)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+ struct list_head *rq_list = &ihq->rq_list;
+ bool merged;
+
+ spin_lock(&ihq->lock);
+ merged = blk_mq_bio_list_merge(hctx->queue, rq_list, bio, nr_segs);
+ spin_unlock(&ihq->lock);
+
+ if (merged && bio_data_dir(bio) == WRITE)
+ ihq->qlen_bytes += bio->bi_iter.bi_size;
+
+ return merged;
+}
+
+/*
+ * The batch size can be adjusted dynamically on a per-hctx basis.
+ * Use per-hctx variables in that case.
+ */
+static inline unsigned int i10_hctx_batch_nr(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return ihq->batch_nr ?
+ ihq->batch_nr : iqd->def_batch_nr;
+}
+
+static inline unsigned int i10_hctx_batch_bytes(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return ihq->batch_bytes ?
+ ihq->batch_bytes : iqd->def_batch_bytes;
+}
+
+static inline unsigned int i10_hctx_batch_timeout(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_queue_data *iqd = hctx->queue->elevator->elevator_data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return ihq->batch_timeout ?
+ ihq->batch_timeout : iqd->def_batch_timeout;
+}
+
+static void i10_hctx_insert_update(struct i10_hctx_queue *ihq,
+ struct request *rq)
+{
+ if (rq_data_dir(rq) == WRITE)
+ ihq->qlen_bytes += blk_rq_bytes(rq);
+ ihq->qlen_nr++;
+}
+
+static void i10_hctx_insert_requests(struct blk_mq_hw_ctx *hctx,
+ struct list_head *rq_list, bool at_head)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+ struct request *rq, *next;
+
+ list_for_each_entry_safe(rq, next, rq_list, queuelist) {
+ struct list_head *head = &ihq->rq_list;
+
+ spin_lock(&ihq->lock);
+ if (at_head)
+ list_move(&rq->queuelist, head);
+ else
+ list_move_tail(&rq->queuelist, head);
+ i10_hctx_insert_update(ihq, rq);
+ blk_mq_sched_request_inserted(rq);
+ spin_unlock(&ihq->lock);
+ }
+
+ /* Start a new timer */
+ if (ihq->state == I10_STATE_BATCH &&
+ !hrtimer_active(&ihq->dispatch_timer))
+ hrtimer_start(&ihq->dispatch_timer,
+ ns_to_ktime(i10_hctx_batch_timeout(hctx)
+ * NSEC_PER_USEC),
+ HRTIMER_MODE_REL);
+}
+
+static struct request *i10_hctx_dispatch_request(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+ struct request *rq;
+
+ spin_lock(&ihq->lock);
+ rq = list_first_entry_or_null(&ihq->rq_list,
+ struct request, queuelist);
+ if (rq)
+ list_del_init(&rq->queuelist);
+ else
+ i10_hctx_queue_reset(ihq);
+ spin_unlock(&ihq->lock);
+
+ return rq;
+}
+
+static inline bool i10_hctx_dispatch_now(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ return (ihq->qlen_nr >= i10_hctx_batch_nr(hctx)) ||
+ (ihq->qlen_bytes >= i10_hctx_batch_bytes(hctx));
+}
+
+/*
+ * Return true if we are in the dispatching state.
+ */
+static bool i10_hctx_has_work(struct blk_mq_hw_ctx *hctx)
+{
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ if (ihq->state == I10_STATE_BATCH) {
+ if (i10_hctx_dispatch_now(hctx)) {
+ ihq->state = I10_STATE_DISPATCH;
+ if (hrtimer_active(&ihq->dispatch_timer))
+ hrtimer_cancel(&ihq->dispatch_timer);
+ }
+ }
+
+ return (ihq->state == I10_STATE_DISPATCH);
+}
+
+#define I10_DEF_BATCH_SHOW_STORE(name) \
+static ssize_t i10_def_batch_##name##_show(struct elevator_queue *e, \
+ char *page) \
+{ \
+ struct i10_queue_data *iqd = e->elevator_data; \
+ \
+ return sprintf(page, "%u\n", iqd->def_batch_##name); \
+} \
+ \
+static ssize_t i10_def_batch_##name##_store(struct elevator_queue *e, \
+ const char *page, size_t count) \
+{ \
+ struct i10_queue_data *iqd = e->elevator_data; \
+ unsigned long long value; \
+ int ret; \
+ \
+ ret = kstrtoull(page, 10, &value); \
+ if (ret) \
+ return ret; \
+ \
+ iqd->def_batch_##name = value; \
+ \
+ return count; \
+}
+I10_DEF_BATCH_SHOW_STORE(nr);
+I10_DEF_BATCH_SHOW_STORE(bytes);
+I10_DEF_BATCH_SHOW_STORE(timeout);
+#undef I10_DEF_BATCH_SHOW_STORE
+
+#define I10_SCHED_ATTR(name) \
+ __ATTR(batch_##name, 0644, i10_def_batch_##name##_show, i10_def_batch_##name##_store)
+static struct elv_fs_entry i10_sched_attrs[] = {
+ I10_SCHED_ATTR(nr),
+ I10_SCHED_ATTR(bytes),
+ I10_SCHED_ATTR(timeout),
+ __ATTR_NULL
+};
+#undef I10_SCHED_ATTR
+
+#ifdef CONFIG_BLK_DEBUG_FS
+#define I10_DEBUGFS_SHOW(name) \
+static int i10_hctx_batch_##name##_show(void *data, struct seq_file *m) \
+{ \
+ struct blk_mq_hw_ctx *hctx = data; \
+ struct i10_hctx_queue *ihq = hctx->sched_data; \
+ \
+ seq_printf(m, "%u\n", ihq->batch_##name); \
+ return 0; \
+} \
+ \
+static int i10_hctx_qlen_##name##_show(void *data, struct seq_file *m) \
+{ \
+ struct blk_mq_hw_ctx *hctx = data; \
+ struct i10_hctx_queue *ihq = hctx->sched_data; \
+ \
+ seq_printf(m, "%u\n", ihq->qlen_##name); \
+ return 0; \
+}
+I10_DEBUGFS_SHOW(nr);
+I10_DEBUGFS_SHOW(bytes);
+#undef I10_DEBUGFS_SHOW
+
+static int i10_hctx_state_show(void *data, struct seq_file *m)
+{
+ struct blk_mq_hw_ctx *hctx = data;
+ struct i10_hctx_queue *ihq = hctx->sched_data;
+
+ seq_printf(m, "%d\n", ihq->state);
+ return 0;
+}
+
+#define I10_HCTX_QUEUE_ATTR(name) \
+ {"batch_" #name, 0400, i10_hctx_batch_##name##_show}, \
+ {"qlen_" #name, 0400, i10_hctx_qlen_##name##_show}
+static const struct blk_mq_debugfs_attr i10_hctx_debugfs_attrs[] = {
+ I10_HCTX_QUEUE_ATTR(nr),
+ I10_HCTX_QUEUE_ATTR(bytes),
+ {"state", 0400, i10_hctx_state_show},
+ {},
+};
+#undef I10_HCTX_QUEUE_ATTR
+#endif
+
+static struct elevator_type i10_sched = {
+ .ops = {
+ .init_sched = i10_init_sched,
+ .exit_sched = i10_exit_sched,
+ .init_hctx = i10_init_hctx,
+ .exit_hctx = i10_exit_hctx,
+ .bio_merge = i10_hctx_bio_merge,
+ .insert_requests = i10_hctx_insert_requests,
+ .dispatch_request = i10_hctx_dispatch_request,
+ .has_work = i10_hctx_has_work,
+ },
+#ifdef CONFIG_BLK_DEBUG_FS
+ .hctx_debugfs_attrs = i10_hctx_debugfs_attrs,
+#endif
+ .elevator_attrs = i10_sched_attrs,
+ .elevator_name = "i10",
+ .elevator_owner = THIS_MODULE,
+};
+
+static int __init i10_init(void)
+{
+ return elv_register(&i10_sched);
+}
+
+static void __exit i10_exit(void)
+{
+ elv_unregister(&i10_sched);
+}
+
+module_init(i10_init);
+module_exit(i10_exit);
+
+MODULE_AUTHOR("Jaehyun Hwang, Qizhe Cai, Midhul Vuppalapati, Rachit Agarwal");
+MODULE_LICENSE("GPLv2");
+MODULE_DESCRIPTION("i10 I/O scheduler");
--
2.22.0