Re: [RFC PATCH 10/14] nvme-tcp: Use CCR to recover controller that hits an error

From: Sagi Grimberg

Date: Sat Dec 27 2025 - 05:35:37 EST

On 26/11/2025 4:11, Mohamed Khalfella wrote:

An alive nvme controller that hits an error now will move to RECOVERING
state instead of RESETTING state. In RECOVERING state ctrl->err_work
will attempt to use cross-controller recovery to terminate inflight IOs
on the controller. If CCR succeeds, then switch to RESETTING state and
continue error recovery as usuall by tearing down controller and attempt
reconnecting to target. If CCR fails, then the behavior of recovery
depends on whether CQT is supported or not. If CQT is supported, switch
to time-based recovery by holding inflight IOs until it is safe for them
to be retried. If CQT is not supported proceed to retry requests
immediately, as the code currently does.

To support implementing time-based recovery turn ctrl->err_work into
delayed work. Update nvme_tcp_timeout() to not complete inflight IOs
while controller in RECOVERING state.

Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
---
drivers/nvme/host/tcp.c | 52 +++++++++++++++++++++++++++++++++++------
1 file changed, 45 insertions(+), 7 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 9a96df1a511c..ec9a713490a9 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -193,7 +193,7 @@ struct nvme_tcp_ctrl {
struct sockaddr_storage src_addr;
struct nvme_ctrl ctrl;
- struct work_struct err_work;
+ struct delayed_work err_work;
struct delayed_work connect_work;
struct nvme_tcp_request async_req;
u32 io_queues[HCTX_MAX_TYPES];
@@ -611,11 +611,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
{
- if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
+ if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RECOVERING) &&
+ !nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))

This warrants an explanation. It is not clear at all why we should allow two different
transitions to allow error recovery to start...

return;
dev_warn(ctrl->device, "starting error recovery\n");
- queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
+ queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, 0);
}
static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
@@ -2470,12 +2471,48 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
nvme_tcp_reconnect_or_remove(ctrl, ret);
}
+static int nvme_tcp_recover_ctrl(struct nvme_ctrl *ctrl)
+{
+ unsigned long rem;
+
+ if (test_and_clear_bit(NVME_CTRL_RECOVERED, &ctrl->flags)) {
+ dev_info(ctrl->device, "completed time-based recovery\n");
+ goto done;
+ }

This is also not clear, why should we get here when NVME_CTRL_RECOVERED is set?

+
+ rem = nvme_recover_ctrl(ctrl);
+ if (!rem)
+ goto done;
+
+ if (!ctrl->cqt) {
+ dev_info(ctrl->device,
+ "CCR failed, CQT not supported, skip time-based recovery\n");
+ goto done;
+ }
+
+ dev_info(ctrl->device,
+ "CCR failed, switch to time-based recovery, timeout = %ums\n",
+ jiffies_to_msecs(rem));
+ set_bit(NVME_CTRL_RECOVERED, &ctrl->flags);
+ queue_delayed_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work, rem);
+ return -EAGAIN;

I don't think that reusing the same work to handle two completely different things
is the right approach here.

How about splitting to fence_work and err_work? That should eliminate some of the
ctrl state inspections and simplify error recovery.

+
+done:
+ nvme_end_ctrl_recovery(ctrl);
+ return 0;
+}
+
static void nvme_tcp_error_recovery_work(struct work_struct *work)
{
- struct nvme_tcp_ctrl *tcp_ctrl = container_of(work,
+ struct nvme_tcp_ctrl *tcp_ctrl = container_of(to_delayed_work(work),
struct nvme_tcp_ctrl, err_work);
struct nvme_ctrl *ctrl = &tcp_ctrl->ctrl;
+ if (nvme_ctrl_state(ctrl) == NVME_CTRL_RECOVERING) {
+ if (nvme_tcp_recover_ctrl(ctrl))
+ return;
+ }
+

Yea, I think we want to rework the current design.