Re: [PATCH v3 18/21] nvme: Update CCR completion wait timeout to consider CQT

From: Randy Jennings

Date: Thu Feb 19 2026 - 21:02:19 EST


Hannes,

> (ctrl->kato * 1000) + ctrl->cqt
As Mohamed pointed out, we have already received a response from a CCR
command. The CCR, once accepted, communicates the death of the
connection to the impacted controller and starts the cleanup tracked
by CQT. So, no need to wait for the impacted controller to figure out
the connection is down.

The max(cqt, kato) was just to give some wait time that should allow
issuing a CCR again from a different controller (in case of losing
communication with this one). It certainly does not need to be longer
than cqt (and it should be no longer than the remaining duration of
time-based retry; that should get addressed at some point). I cannot
remember why kato (if larger; I expect it would be smaller) made sense
at the time.

Sincerely,
Randy Jennings

On Tue, Feb 17, 2026 at 7:35 AM Mohamed Khalfella
<mkhalfella@xxxxxxxxxxxxxxx> wrote:
>
> On Tue 2026-02-17 08:09:33 +0100, Hannes Reinecke wrote:
> > On 2/16/26 19:45, Mohamed Khalfella wrote:
> > > On Mon 2026-02-16 13:54:18 +0100, Hannes Reinecke wrote:
> > >> On 2/14/26 05:25, Mohamed Khalfella wrote:
> > >>> TP8028 Rapid Path Failure Recovery does not define how much time the
> > >>> host should wait for CCR operation to complete. It is reasonable to
> > >>> assume that CCR operation can take up to ctrl->cqt. Update wait time for
> > >>> CCR operation to be max(ctrl->cqt, ctrl->kato).
> > >>>
> > >>> Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
> > >>> ---
> > >>> drivers/nvme/host/core.c | 2 +-
> > >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> > >>> index 0680d05900c1..ff479c0263ab 100644
> > >>> --- a/drivers/nvme/host/core.c
> > >>> +++ b/drivers/nvme/host/core.c
> > >>> @@ -631,7 +631,7 @@ static int nvme_issue_wait_ccr(struct nvme_ctrl *sctrl, struct nvme_ctrl *ictrl)
> > >>> if (result & 0x01) /* Immediate Reset Successful */
> > >>> goto out;
> > >>>
> > >>> - tmo = secs_to_jiffies(ictrl->kato);
> > >>> + tmo = msecs_to_jiffies(max(ictrl->cqt, ictrl->kato * 1000));
> > >>> if (!wait_for_completion_timeout(&ccr.complete, tmo)) {
> > >>> ret = -ETIMEDOUT;
> > >>> goto out;
> > >>
> > >> That is not my understanding. I was under the impression that CQT is the
> > >> _additional_ time a controller requires to clear out outstanding
> > >> commands once it detected a loss of communication (ie _after_ KATO).
> > >> Which would mean we have to wait for up to
> > >> (ctrl->kato * 1000) + ctrl->cqt.
> > >
> > > At this point the source controller knows about communication loss. We
> > > do not need kato wait. In theory we should just wait for CQT.
> > > max(cqt, kato) is a conservative guess I made.
> > >
> > Not quite. The source controller (on the host!) knows about the
> > communication loss. But the target might not, as the keep-alive
> > command might have arrived at the target _just_ before KATO
> > triggered on the host. So the target is still good, and will
> > be waiting for _another_ KATO interval before declaring
> > a loss of communication.
> > And only then will the CQT period start at the target.
> >
> > Randy, please correct me if I'm wrong ...
> >
>
> wait_for_completion_timeout(&ccr.complete, tmo)) waits for CCR operation
> to complete. The wait starts after CCR command completed successfully.
> IOW, it starts after the host received a CQE from source controller on
> the target telling us all is good. If the source controller on the target
> already know about loss of communication then there is no need to wait
> for KATO. We just need to wait for CCR operation to finish because we
> know it has been started successfully.
>
> The specs does not tell us how much time to wait for CCR operation to
> complete. max(cqt, kato) is an estimate I think reasonable to make.