Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

From: Sagi Grimberg
Date: Tue May 21 2024 - 05:50:10 EST




On 21/05/2024 11:48, Nilay Shroff wrote:

On 5/21/24 01:50, John Meneghini wrote:
From: "Ewan D. Milne" <emilne@xxxxxxxxxx>

The round-robin path selector is inefficient in cases where there is a
difference in latency between multiple active optimized paths. In the
presence of one or more high latency paths the round-robin selector
continues to the high latency path equally. This results in a bias
towards the highest latency path and can cause is significant decrease
in overall performance as IOs pile on the lowest latency path. This
problem is particularly accute with NVMe-oF controllers.

The queue-depth policy instead sends I/O requests down the path with the
least amount of requests in its request queue. Paths with lower latency
will clear requests more quickly and have less requests in their queues
compared to higher latency paths. The goal of this path selector is to
make more use of lower latency paths, which will bring down overall IO
latency.

Signed-off-by: Ewan D. Milne <emilne@xxxxxxxxxx>
[tsong: patch developed by Thomas Song @ Pure Storage, fixed whitespace
and compilation warnings, updated MODULE_PARM description, and
fixed potential issue with ->current_path[] being used]
Signed-off-by: Thomas Song <tsong@xxxxxxxxxxxxxxx>
[jmeneghi: vairious changes and improvements, addressed review comments]
Signed-off-by: John Meneghini <jmeneghi@xxxxxxxxxx>
Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@xxxxxxxxxx/
Tested-by: Marco Patalano <mpatalan@xxxxxxxxxx>
Reviewed-by: Randy Jennings <randyj@xxxxxxxxxx>
Tested-by: Jyoti Rani <jani@xxxxxxxxxxxxxxx>
---
drivers/nvme/host/core.c | 2 +-
drivers/nvme/host/multipath.c | 86 +++++++++++++++++++++++++++++++++--
drivers/nvme/host/nvme.h | 9 ++++
3 files changed, 92 insertions(+), 5 deletions(-)

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index a066429b790d..1dd7c52293ff 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -110,7 +110,7 @@ struct workqueue_struct *nvme_delete_wq;
EXPORT_SYMBOL_GPL(nvme_delete_wq);
static LIST_HEAD(nvme_subsystems);
-static DEFINE_MUTEX(nvme_subsystems_lock);
+DEFINE_MUTEX(nvme_subsystems_lock);
static DEFINE_IDA(nvme_instance_ida);
static dev_t nvme_ctrl_base_chr_devt;
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 5397fb428b24..0e2b6e720e95 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,7 @@ MODULE_PARM_DESC(multipath,
static const char *nvme_iopolicy_names[] = {
[NVME_IOPOLICY_NUMA] = "numa",
[NVME_IOPOLICY_RR] = "round-robin",
+ [NVME_IOPOLICY_QD] = "queue-depth",
};
static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +30,8 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
iopolicy = NVME_IOPOLICY_NUMA;
else if (!strncmp(val, "round-robin", 11))
iopolicy = NVME_IOPOLICY_RR;
+ else if (!strncmp(val, "queue-depth", 11))
+ iopolicy = NVME_IOPOLICY_QD;
else
return -EINVAL;
@@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
&iopolicy, 0644);
MODULE_PARM_DESC(iopolicy,
- "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
+ "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");
void nvme_mpath_default_iopolicy(struct nvme_subsystem *subsys)
{
@@ -127,6 +130,11 @@ void nvme_mpath_start_request(struct request *rq)
struct nvme_ns *ns = rq->q->queuedata;
struct gendisk *disk = ns->head->disk;
+ if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
+ atomic_inc(&ns->ctrl->nr_active);
+ nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
+ }
+
if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
return;
@@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
{
struct nvme_ns *ns = rq->q->queuedata;
+ if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
+ atomic_dec_if_positive(&ns->ctrl->nr_active);
+
if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
return;
+
bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
blk_rq_bytes(rq) >> SECTOR_SHIFT,
nvme_req(rq)->start_time);
@@ -330,6 +342,40 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
return found;
}
I think you may also want to reset nr_active counter if in case, in-flight nvme request
is cancelled. If the request is cancelled then nvme_mpath_end_request() wouldn't be invoked.
So you may want to reset nr_active counter from nvme_cancel_request() as below:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index bf7615cb36ee..4fea7883ce8e 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -497,8 +497,9 @@ EXPORT_SYMBOL_GPL(nvme_host_path_error);
bool nvme_cancel_request(struct request *req, void *data)
{
- dev_dbg_ratelimited(((struct nvme_ctrl *) data)->device,
- "Cancelling I/O %d", req->tag);
+ struct nvme_ctrl *ctrl = (struct nvme_ctrl *)data;
+
+ dev_dbg_ratelimited(ctrl->device, "Cancelling I/O %d", req->tag);
/* don't abort one completed or idle request */
if (blk_mq_rq_state(req) != MQ_RQ_IN_FLIGHT)
@@ -506,6 +507,8 @@ bool nvme_cancel_request(struct request *req, void *data)
nvme_req(req)->status = NVME_SC_HOST_ABORTED_CMD;
nvme_req(req)->flags |= NVME_REQ_CANCELLED;
+ if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
+ atomic_dec(&ctrl->nr_active);

Don't think this matters because cancellation only happens when we
teardown the controller anyways...

btw, can we have a better name than nr_active? this is just for IO and only
for multipath. Maybe ctrl->nr_mpath_io_active ?

Also perhaps rename the flag to NVME_MPATH_CTRL_IO_ACCOUNTING ?