Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

From: Randy Jennings
Date: Tue Apr 15 2025 - 18:56:35 EST


On Tue, Apr 15, 2025 at 5:17 AM Daniel Wagner <dwagner@xxxxxxx> wrote:
> On Thu, Apr 10, 2025 at 01:51:37AM -0700, Mohamed Khalfella wrote:
> > > +void nvme_schedule_failover(struct nvme_ctrl *ctrl)
> > > + if (ctrl->cqt)
> > > + delay = msecs_to_jiffies(ctrl->cqt);
> > > + else
> > > + delay = ctrl->kato * HZ;
> >
> > I thought that delay = m * ctrl->kato + ctrl->cqt
> > where m = ctrl->ctratt & NVME_CTRL_ATTR_TBKAS ? 3 : 2
> > no?
>
> The failover schedule delay is the additional amount of time we have to
> wait for the target to cleanup (CQT). If the CTQ is not valid I thought
> the spec said to wait for a KATO. Possible I got that wrong.
This is correct (according to the spec, if CQT is 0, wait for KATO
instead of 0). I would treat that as a suggestion, though.

> The factor 3 or 2 is relavant for the timeout value for the KATO command
> we schedule. The failover schedule timeout is ontop of the command
> timeout value.
3 or 2 is not related to the timeout value of the KATO command. The
timeout value of the KATO command is whatever the host wants it to be
(with some values being more productive than others). The target is
supposed to respond to the KATO command as soon as the target receives
it (roughly). The host timeout value should account for network
delays and getting-around-to-it delays on the target.

2*/3*KATO is from when the host has detected a loss of communication
to when the host knows (given some assumptions) that the target has
detected a loss of communication. A command timeout on the host is a
fine time for the host to decide that it has lost communication with
the target, but there are other events. At the time the host has
detected a loss of communication, it needs to tear down the
association (which includes stopping KATO & starting disconnect on
_all_ the connections in the association). CQT does not start until
the host knows that the target has detected a loss of communication.
So, Mohamed's delay is correct.

> > - What about requests that do not go through nvme_failover_req(), like
> > passthrough requests, do we not want to hold these requests until it
> > is safe for them to be retried?
>
> Pasthrough commands should fail immediately. Userland is in charge here,
> not the kernel. At least this what should happen here.
This is not correct according to the spec, and, I believe, not a good
implementation choice. The driver (in the spec) is instructed _not_
to return an error for any request until it knows (given some
assumptions) that the target is no longer processing the request (or
that processing does not matter; as in the case of a READ).

Sincerely,
Randy Jennings