Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion

From: Randy Jennings

Date: Thu Feb 19 2026 - 22:34:50 EST


On Wed, Feb 18, 2026 at 4:47 AM Mohamed Khalfella
<mkhalfella@xxxxxxxxxxxxxxx> wrote:
>
> On Wed 2026-02-18 08:51:31 +0100, Hannes Reinecke wrote:
> > On 2/17/26 19:25, Mohamed Khalfella wrote:
> At the same time the current implementation only waits for 1*KATO before
> retrying, so there will be regression if we switch to TP8028-compliant
> KATO handling for systems not supporting CCR.
Hannes, as I read the code (this is patch 19), if CQT is not set,
there is no delay. I
was expecting that to continue forward (I would be happy to exclude
'1' also). I agree that we would not want to use CQT where subsystems
have not requested that time to quiesce.

Am I reading this wrong, and you are worried that committed code currently waits
for 1*KATO, and this patch set shortens that? I do not see a delay of 1*KATO in
committed code. What am I missing?

> > > On Mon 2026-02-16 13:43:51 +0100, Hannes Reinecke wrote:

> > So we can (and should) use CCR as the determining factor whether we
> > want to switch to TP8028-compliant behaviour or stick with the original
> > implementation.
>
> We do check CCR support and availability in nvme_find_ctrl_ccr(). Adding
> a second counter will spare us the loop in nvme_find_ctrl_ccr(), which
> is not worth it IMO.

Another option is the Commands Supported log page. CCR is a command,
so support for it should show up there. The data structure is not the
simplest to reference; it might end up more complicated than having a
separate flag (why use another counter?),

RE:
> > want to switch to TP8028-compliant behaviour or stick with the original
> > implementation.
Hannes, do you mean TP8028 or TP4129? Yes, if we do not support CCRs
we should not send them or expect to receive a successful response.

I would be careful of stating this in terms of TP-compliant behavior. I
care about fixing a data corruption. TP4129 worked out what that
required and provided a channel to communicate how long the
subsystem took to clean up, but I really do not care much about
compliance outside of compatibility and predictability. As long as the
data corruption is handled conclusively and in a feasible manner
(IOW, no, the subsystem cannot clean up instantaneously, and we
do have to deal with possible communication delays while
coordinating between the host and subsystem), I can be happy
with the solution.

Sincerely,
Randy Jennings