Re: [PATCH v4 10/15] nvme-tcp: Use CCR to recover controller that hits an error

Next message: Shuah Khan: "Re: [PATCH 6.12 000/144] 6.12.90-rc1 review"
Previous message: Jason Gunthorpe: "Re: [PATCH v4 04/13] dma: swiotlb: track pool encryption state and honor DMA_ATTR_CC_SHARED"
Next in thread: Randy Jennings: "Re: [PATCH v4 10/15] nvme-tcp: Use CCR to recover controller that hits an error"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

From: Randy Jennings

Date: Fri May 15 2026 - 18:51:51 EST

On Fri, Mar 27, 2026 at 5:46 PM Mohamed Khalfella
<mkhalfella@xxxxxxxxxxxxxxx> wrote:
>
> An alive nvme controller that hits an error now will move to FENCING
> state instead of RESETTING state. ctrl->fencing_work attempts CCR to
> terminate inflight IOs. Regardless of the success or failure of CCR
> operation the controller is transitioned to RESETTING state to continue
> error recovery process.
>
> Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
> @@ -612,6 +613,12 @@ static void nvme_tcp_init_recv_ctx(struct nvme_tcp_queue *queue)
>
> static void nvme_tcp_error_recovery(struct nvme_ctrl *ctrl)
> {
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCING)) {
> + dev_warn(ctrl->device, "starting controller fencing\n");
> + queue_work(nvme_wq, &to_tcp_ctrl(ctrl)->fencing_work);
> + return;
> + }
> +
> if (!nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> return;
dev_warn(ctrl->device, "starting error recovery\n");
queue_work(nvme_reset_wq, &to_tcp_ctrl(ctrl)->err_work);
}

> @@ -2471,12 +2478,29 @@ static void nvme_tcp_reconnect_ctrl_work(struct work_struct *work)
> +static void nvme_tcp_fencing_work(struct work_struct *work)
> +{

> + nvme_change_ctrl_state(ctrl, NVME_CTRL_FENCED);
> + if (nvme_change_ctrl_state(ctrl, NVME_CTRL_RESETTING))
> + queue_work(nvme_reset_wq, &tcp_ctrl->err_work);
> +}

There is an inconsistency in logging there for starting error recovery.
I'd rather be able to grep logs and find "starting error recovery" even
if we entered the FENCING state (and there is the race condition related
to Hannes's question were the log could get printed if we had a timeout
at just the right time.

As such, I think we should
dev_warn(ctrl->device, "starting error recovery\n");
before queuing the work.

Sincerely,
Randy Jennings