Re: [PATCH v3 09/21] nvme: Implement cross-controller reset completion

From: Mohamed Khalfella

Date: Wed Feb 18 2026 - 07:47:21 EST


On Wed 2026-02-18 08:51:31 +0100, Hannes Reinecke wrote:
> On 2/17/26 19:25, Mohamed Khalfella wrote:
> > On Mon 2026-02-16 13:43:51 +0100, Hannes Reinecke wrote:
> [ .. ]
> >>
> >> We really would need some indicator whether 'ccr' is supported at all.
> >
> > Why do we need this indicator, other than exporting it via sysfs?
> >
> To avoid false positives.

We will never try CCR on a controller that does not support it. False
positive of what?

>
> >> Using the number of available CCR commands would be an option, if though
> >> that would require us to keep two counters (one for the number of
> >> possible outstanding CCRs, and one for the number of actual outstanding
> >> CCRs.).
> >
> > Like mentioned above ctrl->ccr_limit gives us the number of ccrs
> > available now. It is not 100% indicator if CCR is supported or not, but
> > it is enough to implement CCR. A second counter can help us skip trying
> > CCR if we know impacted controller does not support it.
> >
> > Do you think it is worth it?
> >
> Yes. The problem is that we want to get towards TP8028 compliance, which
> forces us to wait for 2*KATO + CQT before requests on the failed patch
> can be retried. That will cause a _noticeable_ stall on the application
> side. And the only way to shorten that is CCR; once we get confirmation
> from CCR we can start retrying immediately.
> At the same time the current implementation only waits for 1*KATO before
> retrying, so there will be regression if we switch to TP8028-compliant
> KATO handling for systems not supporting CCR.

The statement above is not correct. Careful consideration and testing
has been made to not introduce such regression. If CCR is not supported
then nvme_find_ctrl_ccr() will return NULL and nvme_fence_ctrl() will
return immediately. No CCR command will be sent and no wait for AEN.

What happens next depends on whether ictrl->cqt is supported or not. If
not supported, which will be the case for systems in the field today,
then requests will be retried immediately. Requests will not be held in
this case and no delay will be seen in failover case.

>
> So we can (and should) use CCR as the determining factor whether we
> want to switch to TP8028-compliant behaviour or stick with the original
> implementation.

We do check CCR support and availability in nvme_find_ctrl_ccr(). Adding
a second counter will spare us the loop in nvme_find_ctrl_ccr(), which
is not worth it IMO.

>
> Cheers,
>
> Hannes
> --
> Dr. Hannes Reinecke Kernel Storage Architect
> hare@xxxxxxx +49 911 74053 688
> SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
> HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich