Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

From: Jiewei Ke
Date: Thu Apr 10 2025 - 11:45:55 EST


Hi Daniel,

I just noticed that your patchset addresses a similar issue to the one I ’m
trying to solve with my recently submitted patchset [1]. Compared to your
approach, mine differs in a few key aspects:

1. Only aborted requests are delayed for retry. In the current implementation,
nvmf_complete_timed_out_request and nvme_cancel_request set the request status
to NVME_SC_HOST_ABORTED_CMD. These requests are usually already sent to the
target, but may have timed out or been canceled before a response is received.
Since the target may still be processing them, the host needs to delay retrying
to ensure the target has completed or cleaned up the stale requests. On the
other hand, requests that are not aborted — such as those that never got
submitted due to no usable path (e.g., from nvme_ns_head_submit_bio), or those
that already received an ANA error from the target — do not need delayed retry.

2. The host explicitly disconnects and stops KeepAlive before delay scheduling
retrying requests. This aligns with Section 9.6 "Communication Loss Handling"
of the NVMe Base Specification 2.1. Once the host disconnects, the target may
take up to the KATO interval to detect the lost connection and begin cleaning
up any remaining requests. Retrying too early may still lead to data corruption issues.

@@ -2178,6 +2180,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
nvme_quiesce_admin_queue(&ctrl->ctrl);
nvme_disable_ctrl(&ctrl->ctrl, shutdown);
nvme_rdma_teardown_admin_queue(ctrl, shutdown);
+ nvme_delay_kick_retry_lists(&ctrl->ctrl); <<< delay kick retry after teardown all queues
}

3. Delayed retry of aborted requests is supported across multiple scenarios,
including error recovery work, controller reset, controller deletion, and
controller reconnect failure handling. Here's the relevant code for reference.

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9109d5476417..f07b3960df7c 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2449,6 +2449,7 @@ static int nvme_tcp_setup_ctrl(struct nvme_ctrl *ctrl, bool new)
destroy_admin:
nvme_stop_keep_alive(ctrl);
nvme_tcp_teardown_admin_queue(ctrl, new);
+ nvme_delay_kick_retry_lists(ctrl); <<< requests may be timed out when ctrl reconnects
return ret;
}

@@ -2494,6 +2495,7 @@ static void nvme_tcp_error_recovery_work(struct work_struct *work)
nvme_tcp_teardown_admin_queue(ctrl, false);
nvme_unquiesce_admin_queue(ctrl);
nvme_auth_stop(ctrl);
+ nvme_delay_kick_retry_lists(ctrl); <<< retry_lists may contain timed out or cancelled requests

if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_CONNECTING)) {
/* state change failure is ok if we started ctrl delete */

@@ -2513,6 +2515,7 @@ static void nvme_tcp_teardown_ctrl(struct nvme_ctrl *ctrl, bool shutdown)
nvme_quiesce_admin_queue(ctrl);
nvme_disable_ctrl(ctrl, shutdown);
nvme_tcp_teardown_admin_queue(ctrl, shutdown);
+ nvme_delay_kick_retry_lists(ctrl); <<< retry_lists may contain timed out or cancelled requests when ctrl reset or delete
}

Besides, in nvme_tcp_error_recovery_work, the delayed retry must occur after
nvme_tcp_teardown_io_queues, because the teardown cancels requests that may need
to be retried too.

One limitation of my patchset is that it does not yet include full CQT support,
and due to testing environment constraints, only nvme_tcp and nvme_rdma are
currently covered.

I’d be happy to discuss the pros and cons of both approaches -- perhaps we can
combine the best aspects.

Looking forward to your thoughts.

Thanks,
Jiewei

[1] https://lore.kernel.org/linux-nvme/20250410122054.2526358-1-jiewei@xxxxxxxxxx/