Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

From: Randy Jennings
Date: Tue Apr 15 2025 - 19:03:37 EST


On Tue, Apr 15, 2025 at 2:07 PM Sagi Grimberg <sagi@xxxxxxxxxxx> wrote:
> On 15/04/2025 15:11, Daniel Wagner wrote:
> > On Tue, Apr 15, 2025 at 01:28:15AM +0300, Sagi Grimberg wrote:
> >>>> +void nvme_schedule_failover(struct nvme_ctrl *ctrl)
> >>>> +{
> >>>> + unsigned long delay;
> >>>> +
> >>>> + if (ctrl->cqt)
> >>>> + delay = msecs_to_jiffies(ctrl->cqt);
> >>>> + else
> >>>> + delay = ctrl->kato * HZ;
> >>> I thought that delay = m * ctrl->kato + ctrl->cqt
> >>> where m = ctrl->ctratt & NVME_CTRL_ATTR_TBKAS ? 3 : 2
> >>> no?
> >> This was said before, but if we are going to always start waiting for kato
> >> for failover purposes,
> >> we first need a patch that prevent kato from being arbitrarily long.
> > That should be addressed with the cross controller reset (CCR).
>
> CCR is a better solution as it is explicit, and faster.
>
> > The KATO*n
> > + CQT is the upper limit for the target recovery. As soon we have CCR,
> > the recovery delay is reduced to the time the CCR exchange takes.
>
> What I meant was that the user can no longer set kato to be arbitrarily
> long when we
> now introduce failover dependency on it.
>
> We need to set a sane maximum value that will failover in a reasonable
> timeframe.
> In other words, kato cannot be allowed to be set by the user to 60
> minutes. While we didn't
> care about it before, now it means that failover may take 60+ minutes.
>
> Hence, my request to set kato to a max absolute value of seconds. My
> vote was 10 (2x of the default),
> but we can also go with 30.
Adding a maximum value for KATO makes a lot of sense to me. This will
help keep us away from a hung task timeout when the full delay is
taken into account. 30 makes sense to me from the perspective that
the maximum should be long enough to handle non-ideal situations
functionally, but not a value that you expect people to use regularly.

I think CQT should have a maximum allowed value for similar reasons.
If we do clamp down on the CQT, we could be opening ourselves to the
target not completely cleaning up, but it keeps us from a hung task
timeout, and _any_ delay will help most of the time.

CCR will not address arbitrarily long times for either because:
1. It is optional.
2. It may fail.
3. We still need a ceiling on the recovery time we can handle.

Sincerely,
Randy Jennings