Re: [PATCH v5] nvme: multipath: Implemented new iopolicy "queue-depth"

From: Hannes Reinecke
Date: Thu May 23 2024 - 02:28:42 EST


On 5/23/24 06:29, Chaitanya Kulkarni wrote:
On 5/22/24 09:54, John Meneghini wrote:
From: "Ewan D. Milne" <emilne@xxxxxxxxxx>

The round-robin path selector is inefficient in cases where there is a
difference in latency between paths. In the presence of one or more
high latency paths the round-robin selector continues to use the high
latency path equally. This results in a bias towards the highest latency
path and can cause a significant decrease in overall performance as IOs
pile on the highest latency path. This problem is acute with NVMe-oF
controllers.

The queue-depth policy instead sends I/O requests down the path with the
least amount of requests in its request queue. Paths with lower latency
will clear requests more quickly and have less requests in their queues
compared to higher latency paths. The goal of this path selector is to
make more use of lower latency paths which will bring down overall IO
latency and increase throughput and performance.

Signed-off-by: Thomas Song <tsong@xxxxxxxxxxxxxxx>
[emilne: patch developed by Thomas Song @ Pure Storage, fixed whitespace
and compilation warnings, updated MODULE_PARM description, and
fixed potential issue with ->current_path[] being used]
Signed-off-by: Ewan D. Milne <emilne@xxxxxxxxxx>
[jmeneghi: vairious changes and improvements, addressed review comments]
Signed-off-by: John Meneghini <jmeneghi@xxxxxxxxxx>
Link: https://lore.kernel.org/linux-nvme/20240509202929.831680-1-jmeneghi@xxxxxxxxxx/
Tested-by: Marco Patalano <mpatalan@xxxxxxxxxx>
Reviewed-by: Randy Jennings <randyj@xxxxxxxxxxxxxxx>
Tested-by: Jyoti Rani <jrani@xxxxxxxxxxxxxxx>
---

[...]

+static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
+{
+ struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+ unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+ unsigned int depth;
+
+ list_for_each_entry_rcu(ns, &head->list, siblings) {
+ if (nvme_path_is_disabled(ns))
+ continue;
+
+ depth = atomic_read(&ns->ctrl->nr_active);
+
+ switch (ns->ana_state) {
+ case NVME_ANA_OPTIMIZED:
+ if (depth < min_depth_opt) {
+ min_depth_opt = depth;
+ best_opt = ns;
+ }
+ break;
+

nit:- no need to add white line needed after break above ?

+ case NVME_ANA_NONOPTIMIZED:
+ if (depth < min_depth_nonopt) {
+ min_depth_nonopt = depth;
+ best_nonopt = ns;
+ }
+ break;
+ default:
+ break;
+ }
+
+ if (min_depth_opt == 0)
+ return best_opt;
+ }
+
+ return best_opt ? best_opt : best_nonopt;
+}
+

[...]

@@ -800,6 +860,29 @@ static ssize_t nvme_subsys_iopolicy_show(struct device *dev,
nvme_iopolicy_names[READ_ONCE(subsys->iopolicy)]);
}
+static void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)

nit:- overly long line, as rest of the file is < 80 char par line ?

+{
+ struct nvme_ctrl *ctrl;
+ int old_iopolicy = READ_ONCE(subsys->iopolicy);
+
+ if (old_iopolicy == iopolicy)
+ return;
+
+ WRITE_ONCE(subsys->iopolicy, iopolicy);
+
+ /* iopolicy changes reset the counters and clear the mpath by design */
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ atomic_set(&ctrl->nr_active, 0);
+ nvme_mpath_clear_ctrl_paths(ctrl);
+ }
+ mutex_unlock(&nvme_subsystems_lock);
+
+ pr_notice("%s: changed from %s to %s for subsysnqn %s\n", __func__,
+ nvme_iopolicy_names[old_iopolicy], nvme_iopolicy_names[iopolicy],

nit: overly long line above as rest of the file is < 80 char ...

As far as I remember, most of the nvme code uses pr_info(), but if
decision has been made to use pr_notice() for a specific reason then
please ignore this comment.

+ subsys->subnqn);
+}
+

[...]

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index fc31bd340a63..fa3833d88a85 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -50,6 +50,8 @@ extern struct workqueue_struct *nvme_wq;
extern struct workqueue_struct *nvme_reset_wq;
extern struct workqueue_struct *nvme_delete_wq;
+extern struct mutex nvme_subsystems_lock;
+
/*
* List of workarounds for devices that required behavior not specified in
* the standard.
@@ -195,6 +197,7 @@ enum {
NVME_REQ_CANCELLED = (1 << 0),
NVME_REQ_USERCMD = (1 << 1),
NVME_MPATH_IO_STATS = (1 << 2),
+ NVME_MPATH_CNT_ACTIVE = (1 << 3),

nit:- I think we need to align above line to rest of the members in
      this enum ...

};
static inline struct nvme_request *nvme_req(struct request *req)
@@ -359,6 +362,7 @@ struct nvme_ctrl {
size_t ana_log_size;
struct timer_list anatt_timer;
struct work_struct ana_work;
+ atomic_t nr_active;
#endif
#ifdef CONFIG_NVME_HOST_AUTH
@@ -407,6 +411,7 @@ static inline enum nvme_ctrl_state nvme_ctrl_state(struct nvme_ctrl *ctrl)
enum nvme_iopolicy {
NVME_IOPOLICY_NUMA,
NVME_IOPOLICY_RR,
+ NVME_IOPOLICY_QD,
};
struct nvme_subsystem {

apart from the few nits patch does looks good to me.

Reviewed-by: Chaitanya Kulkarni <kch@xxxxxxxxxx>

not a blocker to merge this patch, but we need a blktests for this code
in nvme
category ...

As presented at LSF by Daniel; ALUA support (and, with that, multipath support) is one of the topics to be implemented for blktests.
And without that we can't have a meaningful QD test.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich