Re: [PATCH v4 10/15] nvme-tcp: Use CCR to recover controller that hits an error

From: Randy Jennings

Date: Fri May 15 2026 - 18:51:51 EST


On Fri, Mar 27, 2026 at 5:46 PM Mohamed Khalfella
<mkhalfella@xxxxxxxxxxxxxxx> wrote:
>
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. Regardless of the success or failure of CCR
> operation the controller is transitioned to RESETTING state to continue
> error recovery process.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
> @@ -612,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> + dev_warn(ctrl->device, "starting controller fencing\n");
> + queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> + return;
> + }
> +
> if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> return;
dev_warn(ctrl->device, "starting error recovery\n");
queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
}


> @@ -2471,12 +2478,29 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> +static void nvme_tcp_fencing_work(struct work_struct *work)
> +{

> + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> +}

There is an inconsistency in logging there for starting error recovery.
I'd rather be able to grep logs and find "starting error recovery" even
if we entered the FENCING state (and there is the race condition related
to Hannes's question were the log could get printed if we had a timeout
at just the right time.

As such, I think we should
dev_warn(ctrl->device, "starting error recovery\n");
before queuing the work.

Sincerely,
Randy Jennings