Re: [PATCH v3 08/21] nvme: Implement cross-controller reset recovery

From: Hannes Reinecke

Date: Mon Feb 16 2026 - 07:41:52 EST

On 2/14/26 05:25, Mohamed Khalfella wrote:

A host that has more than one path connecting to an nvme subsystem
typically has an nvme controller associated with every path. This is
mostly applicable to nvmeof. If one path goes down, inflight IOs on that
path should not be retried immediately on another path because this
could lead to data corruption as described in TP4129. TP8028 defines
cross-controller reset mechanism that can be used by host to terminate
IOs on the failed path using one of the remaining healthy paths. Only
after IOs are terminated, or long enough time passes as defined by
TP4129, inflight IOs should be retried on another path. Implement core
cross-controller reset shared logic to be used by the transports.

Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
---
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 141 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 9 +++
3 files changed, 151 insertions(+)

diff --git a/drivers/nvme/host/constants.c b/drivers/nvme/host/constants.c
index dc90df9e13a2..f679efd5110e 100644
--- a/drivers/nvme/host/constants.c
+++ b/drivers/nvme/host/constants.c
@@ -46,6 +46,7 @@ static const char * const nvme_admin_ops[] = {
[nvme_admin_virtual_mgmt] = "Virtual Management",
[nvme_admin_nvme_mi_send] = "NVMe Send MI",
[nvme_admin_nvme_mi_recv] = "NVMe Receive MI",
+ [nvme_admin_cross_ctrl_reset] = "Cross Controller Reset",
[nvme_admin_dbbuf] = "Doorbell Buffer Config",
[nvme_admin_format_nvm] = "Format NVM",
[nvme_admin_security_send] = "Security Send",
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 231d402e9bfb..765b1524b3ed 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -554,6 +554,146 @@ void nvme_cancel_admin_tagset(struct nvme_ctrl *ctrl)
}
EXPORT_SYMBOL_GPL(nvme_cancel_admin_tagset);
+static struct nvme_ctrl *nvme_find_ctrl_ccr(struct nvme_ctrl *ictrl,
+ u32 min_cntlid)
+{
+ struct nvme_subsystem *subsys = ictrl->subsys;
+ struct nvme_ctrl *ctrl, *sctrl = NULL;
+ unsigned long flags;
+
+ mutex_lock(&nvme_subsystems_lock);
+ list_for_each_entry(ctrl, &subsys->ctrls, subsys_entry) {
+ if (ctrl->cntlid < min_cntlid)
+ continue;
+
+ if (atomic_dec_if_positive(&ctrl->ccr_limit) < 0)
+ continue;
+
+ spin_lock_irqsave(&ctrl->lock, flags);
+ if (ctrl->state != NVME_CTRL_LIVE) {
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ atomic_inc(&ctrl->ccr_limit);
+ continue;
+ }
+
+ /*
+ * We got a good candidate source controller that is locked and
+ * LIVE. However, no guarantee ctrl will not be deleted after
+ * ctrl->lock is released. Get a ref of both ctrl and admin_q
+ * so they do not disappear until we are done with them.
+ */
+ WARN_ON_ONCE(!blk_get_queue(ctrl->admin_q));
+ nvme_get_ctrl(ctrl);
+ spin_unlock_irqrestore(&ctrl->lock, flags);
+ sctrl = ctrl;
+ break;
+ }
+ mutex_unlock(&nvme_subsystems_lock);
+ return sctrl;
+}
+
+static void nvme_put_ctrl_ccr(struct nvme_ctrl *sctrl)
+{
+ atomic_inc(&sctrl->ccr_limit);
+ blk_put_queue(sctrl->admin_q);
+ nvme_put_ctrl(sctrl);
+}
+
+static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
+{
+ struct nvme_ccr_entry ccr = { };
+ union nvme_result res = { 0 };
+ struct nvme_command c = { };
+ unsigned long flags, tmo;
+ bool completed = false;
+ int ret = 0;
+ u32 result;
+
+ init_completion(&ccr.complete);
+ ccr.ictrl = ictrl;
+
+ spin_lock_irqsave(&sctrl->lock, flags);
+ list_add_tail(&ccr.list, &sctrl->ccr_list);
+ spin_unlock_irqrestore(&sctrl->lock, flags);
+
+ c.ccr.opcode = nvme_admin_cross_ctrl_reset;
+ c.ccr.ciu = ictrl->ciu;
+ c.ccr.icid = cpu_to_le16(ictrl->cntlid);
+ c.ccr.cirn = cpu_to_le64(ictrl->cirn);
+ ret = __nvme_submit_sync_cmd(sctrl->admin_q, &c, &res,
+ NULL, 0, NVME_QID_ANY, 0);
+ if (ret) {
+ ret = -EIO;
+ goto out;
+ }
+
+ result = le32_to_cpu(res.u32);
+ if (result & 0x01) /* Immediate Reset Successful */
+ goto out;
+
+ tmo = secs_to_jiffies(ictrl->kato);
+ if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
+ ret = -ETIMEDOUT;
+ goto out;
+ }
+

That will be tricky. The 'ccr' comand will be sent with the default
command queue timeout which is decoupled from KATO.
So you really should set the command timeout for the 'ccr' command
to ctrl->kato to ensure it'll be terminated correctly.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich