Re: [PATCH v3 1/1] nvme: multipath: Implemented new iopolicy "queue-depth"

From: Keith Busch
Date: Mon May 20 2024 - 16:50:15 EST


On Mon, May 20, 2024 at 04:20:45PM -0400, John Meneghini wrote:
> From: "Ewan D. Milne" <emilne@xxxxxxxxxx>
>
> The round-robin path selector is inefficient in cases where there is a
> difference in latency between multiple active optimized paths. In the
> presence of one or more high latency paths the round-robin selector
> continues to the high latency path equally. This results in a bias
> towards the highest latency path and can cause is significant decrease
> in overall performance as IOs pile on the lowest latency path. This
> problem is particularly accute with NVMe-oF controllers.

The patch looks pretty good to me. Just a few questions/comments.

> static LIST_HEAD(nvme_subsystems);
> -static DEFINE_MUTEX(nvme_subsystems_lock);
> +DEFINE_MUTEX(nvme_subsystems_lock);

This seems odd. Why is this lock protecting both the global
nvme_subsystems list, and also subsystem controllers? IOW, why isn't the
subsys->ctrls list protected by the more fine grained 'subsys->lock'
instead of this global lock?

> @@ -43,7 +46,7 @@ static int nvme_get_iopolicy(char *buf, const struct kernel_param *kp)
> module_param_call(iopolicy, nvme_set_iopolicy, nvme_get_iopolicy,
> &iopolicy, 0644);
> MODULE_PARM_DESC(iopolicy,
> - "Default multipath I/O policy; 'numa' (default) or 'round-robin'");
> + "Default multipath I/O policy; 'numa' (default) , 'round-robin' or 'queue-depth'");

Unnecessary space before the ','.

> + if (READ_ONCE(ns->head->subsys->iopolicy) == NVME_IOPOLICY_QD) {
> + atomic_inc(&ns->ctrl->nr_active);
> + nvme_req(rq)->flags |= NVME_MPATH_CNT_ACTIVE;
> + }
> +
> if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
> return;
>
> @@ -140,8 +148,12 @@ void nvme_mpath_end_request(struct request *rq)
> {
> struct nvme_ns *ns = rq->q->queuedata;
>
> + if ((nvme_req(rq)->flags & NVME_MPATH_CNT_ACTIVE))
> + atomic_dec_if_positive(&ns->ctrl->nr_active);

You can just do a atomic_dec() since your new flag has this tied to to
the atomic_inc().

> +static struct nvme_ns *nvme_queue_depth_path(struct nvme_ns_head *head)
> +{
> + struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
> + unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
> + unsigned int depth;
> +
> + list_for_each_entry_rcu(ns, &head->list, siblings) {
> + if (nvme_path_is_disabled(ns))
> + continue;
> +
> + depth = atomic_read(&ns->ctrl->nr_active);
> +
> + switch (ns->ana_state) {
> + case NVME_ANA_OPTIMIZED:
> + if (depth < min_depth_opt) {
> + min_depth_opt = depth;
> + best_opt = ns;
> + }
> + break;
> +
> + case NVME_ANA_NONOPTIMIZED:
> + if (depth < min_depth_nonopt) {
> + min_depth_nonopt = depth;
> + best_nonopt = ns;
> + }
> + break;
> + default:
> + break;
> + }

Could we break out of this loop early if "min_depth_opt == 0"? We can't
find a better path that that, so no need to read the rest of the paths.

> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy)
> +{
> + struct nvme_ctrl *ctrl;
> + int old_iopolicy = READ_ONCE(subsys->iopolicy);
> +

Let's add a check here:

if (old_iopolicy == iopolicy)
return;

> @@ -935,6 +940,7 @@ void nvme_mpath_clear_ctrl_paths(struct nvme_ctrl *ctrl);
> void nvme_mpath_shutdown_disk(struct nvme_ns_head *head);
> void nvme_mpath_start_request(struct request *rq);
> void nvme_mpath_end_request(struct request *rq);
> +void nvme_subsys_iopolicy_update(struct nvme_subsystem *subsys, int iopolicy);

This funciton isn't used outside multipath.c, so it should be static.