Re: [PATCH v3 19/21] nvme-tcp: Extend FENCING state per TP4129 on CCR failure

From: Hannes Reinecke

Date: Wed Feb 18 2026 - 03:26:48 EST


On 2/17/26 18:58, Mohamed Khalfella wrote:
On Mon 2026-02-16 13:56:10 +0100, Hannes Reinecke wrote:
On 2/14/26 05:25, Mohamed Khalfella wrote:
If CCR operations fail and CQT is supported, we must defer the retry of
inflight requests per TP4129. Update ctrl->fencing_work to schedule
ctrl->fenced_work, effectively extending the FENCING state. This delay
ensures that inflight requests are held until it is safe for them to be
retired.

Signed-off-by: Mohamed Khalfella <mkhalfella@xxxxxxxxxxxxxxx>
---
drivers/nvme/host/tcp.c | 39 +++++++++++++++++++++++++++++++++++----
1 file changed, 35 insertions(+), 4 deletions(-)

Can't you merge / integrate this into the nvme_fence_ctrl() routine?

ctrl->fencing_work and ctrl->fenced_work are in transport specific
controller, struct nvme_tcp_ctrl in this case. There is no easy way to
access these members from nvme_fence_ctrl(). One option to go around
that is to move them into struct nvme_ctrl. But we call error recovery
after a controller is fenced, and error recovery is implemented in
transport specific way. That is why the delay is implemented/repeated
for every transport.

The previous patch already extended the timeout to cover for CQT, so
we can just wait for the timeout if CCR failed, no?

Following on the point above. One change can be done is to reset the
controller after fencing finishes instead of using error recovery.
This way everything lives in core.c. But I have not tested that.

Do you think this is better than what has been implemented now?

Yeah, the eternal problem.
At one point someone will have to explain to my why 'reset' and
'error handling' are two _distinct_ code paths in nvme-tcp.
I really don't get that. I _guess_ it's trying to hold requests
when doing a reset, and aborting requests if it's an error.
But why one needs to make that distinction is a mystery to
me; FC combines both paths and seems to work quite happily.

Thing is, that will get in the way when trying to move fencing
into the generic layer; you only can call 'nvme_reset_ctrl()',
and hope that this one will abort commands.

I'll check.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich