Re: [PATCH v4 08/15] nvme: Implement cross-controller reset recovery
From: Hannes Reinecke
Date: Tue Apr 07 2026 - 01:39:23 EST
On 3/31/26 18:47, Mohamed Khalfella wrote:
On Mon 2026-03-30 12:50:24 +0200, Hannes Reinecke wrote:[ .. ]
On 3/28/26 01:43, Mohamed Khalfella wrote:
A host that has more than one path connecting to an nvme subsystem
typically has an nvme controller associated with every path. This is
mostly applicable to nvmeof. If one path goes down, inflight IOs on that
path should not be retried immediately on another path because this
could lead to data corruption as described in TP4129. TP8028 defines
cross-controller reset mechanism that can be used by host to terminate
IOs on the failed path using one of the remaining healthy paths. Only
after IOs are terminated, or long enough time passes as defined by
TP4129, inflight IOs should be retried on another path. Implement core
cross-controller reset shared logic to be used by the transports.
Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
---
drivers/nvme/host/constants.c | 1 +
drivers/nvme/host/core.c | 145 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/nvme.h | 9 +++
3 files changed, 155 insertions(+)
Yes, thank you.+
+int nvme_fence_ctrl(struct nvme_ctrl *ictrl)
+{
+ unsigned long deadline, timeout;
+ struct nvme_ctrl *sctrl;
+ u32 min_cntlid = 0;
+ int ret;
+
+ timeout = nvme_fence_timeout_ms(ictrl);
+ dev_info(ictrl->device, "attempting CCR, timeout %lums\n", timeout);
+
+ deadline = jiffies + msecs_to_jiffies(timeout);
+ while (time_is_after_jiffies(deadline)) {
+ sctrl = nvme_find_ctrl_ccr(ictrl, min_cntlid);
+ if (!sctrl) {
+ dev_dbg(ictrl->device,
+ "failed to find source controller\n");
+ return -EIO;
+ }
+
+ ret = nvme_issue_wait_ccr(sctrl, ictrl, deadline);
+ if (!ret) {
+ dev_info(ictrl->device, "CCR succeeded using %s\n",
+ dev_name(sctrl->device));
+ nvme_put_ctrl_ccr(sctrl);
+ return 0;
+ }
+
+ min_cntlid = sctrl->cntlid + 1;
+ nvme_put_ctrl_ccr(sctrl);
+
+ if (ret == -EIO) /* CCR command failed */
+ continue;
+
+ /* CCR operation failed or timed out */
+ return ret;
+ }
+
+ dev_info(ictrl->device, "CCR operation timeout\n");
+ return -ETIMEDOUT;
+}
Please restructure the loop.
Having a comment 'CCR operation failed or timed out',
returning a status, and then have a comment
'CCR operation timeout' _after_ the return is confusing.
I can change /* CCR operation failed or timed out */ to something like
/*
* Source controller accepted CCR command but CCR operation
* timed out or failed. Retrying another path is not likely
* to succeed, return an error.
*/
And change the log line "CCR operation timeout\n" outside the while
loop to "fencing timedout\n".
Will this help?
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich