Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout

From: Sagi Grimberg
Date: Tue Apr 15 2025 - 17:07:29 EST

Next message: Reinette Chatre: "Re: [PATCH v8 03/21] x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_""
Previous message: Reinette Chatre: "Re: [PATCH v8 02/21] x86/resctrl: Remove the limit on the number of CLOSID"
In reply to: Daniel Wagner: "Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout"
Next in thread: Randy Jennings: "Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 15/04/2025 15:11, Daniel Wagner wrote:

On Tue, Apr 15, 2025 at 01:28:15AM +0300, Sagi Grimberg wrote:

+void nvme_schedule_failover(struct nvme_ctrl *ctrl)
+{
+ unsigned long delay;
+
+ if (ctrl->cqt)
+ delay = msecs_to_jiffies(ctrl->cqt);
+ else
+ delay = ctrl->kato * HZ;

I thought that delay = m * ctrl->kato + ctrl->cqt
where m = ctrl->ctratt & NVME_CTRL_ATTR_TBKAS ? 3 : 2
no?

This was said before, but if we are going to always start waiting for kato
for failover purposes,
we first need a patch that prevent kato from being arbitrarily long.

That should be addressed with the cross controller reset (CCR).

CCR is a better solution as it is explicit, and faster.

The KATO*n
+ CQT is the upper limit for the target recovery. As soon we have CCR,
the recovery delay is reduced to the time the CCR exchange takes.

What I meant was that the user can no longer set kato to be arbitrarily long when we
now introduce failover dependency on it.

We need to set a sane maximum value that will failover in a reasonable timeframe.
In other words, kato cannot be allowed to be set by the user to 60 minutes. While we didn't
care about it before, now it means that failover may take 60+ minutes.

Hence, my request to set kato to a max absolute value of seconds. My vote was 10 (2x of the default),
but we can also go with 30.

Next message: Reinette Chatre: "Re: [PATCH v8 03/21] x86/resctrl: Rename resctrl_sched_in() to begin with "resctrl_arch_""
Previous message: Reinette Chatre: "Re: [PATCH v8 02/21] x86/resctrl: Remove the limit on the number of CLOSID"
In reply to: Daniel Wagner: "Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout"
Next in thread: Randy Jennings: "Re: [PATCH RFC 3/3] nvme: delay failover by command quiesce timeout"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]