Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

From: Jiewei Ke
Date: Tue Apr 15 2025 - 11:09:19 EST


Hi Daniel,

> 2025年4月15日 20:34,Daniel Wagner <dwagner@xxxxxxx> 写道:
>
> Hi Jiewei,
>
> On Thu, Apr 10, 2025 at 11:44:13PM +0800, Jiewei Ke wrote:
>> I just noticed that your patchset addresses a similar issue to the one I ’m
>> trying to solve with my recently submitted patchset [1]. Compared to your
>> approach, mine differs in a few key aspects:
>>
>> 1. Only aborted requests are delayed for retry. In the current implementation,
>> nvmf_complete_timed_out_request and nvme_cancel_request set the request status
>> to NVME_SC_HOST_ABORTED_CMD. These requests are usually already sent to the
>> target, but may have timed out or been canceled before a response is received.
>> Since the target may still be processing them, the host needs to delay retrying
>> to ensure the target has completed or cleaned up the stale requests. On the
>> other hand, requests that are not aborted — such as those that never got
>> submitted due to no usable path (e.g., from nvme_ns_head_submit_bio), or those
>> that already received an ANA error from the target — do not need
>> delayed retry.
>
> If I understand you correctly, you are concerned about delaying all
> commands even these which are not transmitted yet. If there are no
> garantees on ordering it would be possible to failover these commands
> immediately. Sure something which could be improved in my series.

Yes, your patchset delays all commands. In fact, some distinction needs to be made:
commands with path errors and those that haven't been sent yet can be failed over
immediately. You might consider applying the delay only to commands with status
NVME_SC_HOST_ABORTED_CMD.

>> 2. The host explicitly disconnects and stops KeepAlive before delay scheduling
>> retrying requests. This aligns with Section 9.6 "Communication Loss Handling"
>> of the NVMe Base Specification 2.1. Once the host disconnects, the target may
>> take up to the KATO interval to detect the lost connection and begin cleaning
>> up any remaining requests. Retrying too early may still lead to data
>> corruption issues.
>
> The host error handler is executed thus all communication is stopped.
> This part hasn't changed. Not sure what you are referring too. The only
> thing which changes in this series is when we enqueue the failed
> commands again.

In your patch as shown below, I think nvme_schedule_failover should be called after
nvme_rdma_teardown_admin_queue for two reasons:

* nvme_rdma_teardown_io_queues will call nvme_cancel_tagset, and then the inflight
commands will be added to the ctrl's request_list. So nvme_schedule_failover should be
after nvme_rdma_teardown_io_queues to kick them.

* This does not follow the rule of starting the delay timer(aka nvme_schedule_failover)
after all communication is stopped.

> @@ -2153,6 +2154,7 @@ static const struct blk_mq_ops nvme_rdma_admin_mq_ops = {
>
> static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
> {
> + nvme_schedule_failover(&ctrl->ctrl);
> nvme_rdma_teardown_io_queues(ctrl, shutdown);
> nvme_quiesce_admin_queue(&ctrl->ctrl);
> nvme_disable_ctrl(&ctrl->ctrl, shutdown);
> nvme_rdma_teardown_admin_queue(ctrl, shutdown);

>> @@ -2178,6 +2180,7 @@ static void nvme_rdma_shutdown_ctrl(struct nvme_rdma_ctrl *ctrl, bool shutdown)
>> nvme_quiesce_admin_queue(&ctrl->ctrl);
>> nvme_disable_ctrl(&ctrl->ctrl, shutdown);
>> nvme_rdma_teardown_admin_queue(ctrl, shutdown);
>> + nvme_delay_kick_retry_lists(&ctrl->ctrl); <<< delay kick retry
>> after teardown all queues
>
> Without the kick it hangs. The admin has explicitly removed the ctrl.
>
> As I already said, this is a RFC for the sake to figure out if this
> approach is a good idea. We all agreed, we should first try to sort this
> out before introducing a new queue. There are many problems which it
> will introduce, like the one from above 'why delaying not send
> requests?', 'What happens when we have several short
> disconnects/connects?', ...
>
> BTW, there is already a hack in disconnect/connect state transition.
> Ideally we solve this in a more generic manner.

I agree that introducing a controller-level queue is necessary, these commands pending
retry should be uniformly delayed after the controller finishes disconnecting.

Regarding the issue of multiple connections and disconnections, in my patchset I avoided
using delayed_work, and instead used a synchronous msleep approach. This prevents
premature retries.

If we use delayed_work, the following issue may arise: during a controller reset, suppose
an inflight command (command 1) is canceled and added to the controller’s request_list,
then the connection is torn down and nvme_schedule_failover is called. Command 1 will
be scheduled to retry after 30 seconds. And then the reconnection completes, but
command 2 encounters an I/O timeout and is also added to the request_list. At this point,
the 30-second delay expires, and command 2 might be mistakenly retried immediately.

The current state machine cannot avoid this issue if using delayed_work.

Thanks,
Jiewei