RE: [PATCH v3 00/21] TP8028 Rapid Path Failure Recovery

From: Achkinazi, Igor

Date: Wed Apr 01 2026 - 09:35:41 EST


Hi Mohamed,

We tested this patch v3 against Dell PowerFlex as the NVMe over TCP target.
Below is a summary of our test methodology and results.

Test Environment
----------------
- Target: Dell PowerFlex with NVMe over TCP with engineering code
- Host: Standard Linux host with the patch
- IO: vdbench with data integrity (DI) validation

Test Scenarios and Results
--------------------------

1) Target supports CQT + CCR -- without the patch (baseline)

vdbench DI validation FAILED. Data integrity errors observed.

2) Target supports CQT + CCR -- with the patch applied

Keep Alive timeout triggers controller fencing. CCR is issued
via the surviving controller and succeeds. Controller reconnects
and vdbench DI validation PASSES with no data integrity errors.

Kernel log:
nvme nvme4: I/O tag 1 (b001) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
nvme nvme4: starting controller fencing
nvme nvme4: attempting CCR, timeout 15000ms
nvme nvme4: CCR succeeded using nvme3
nvme nvme4: failed nvme_keep_alive_end_io error=10
nvme nvme4: Reconnecting in 10 seconds...
nvme nvme4: creating 1 I/O queues.
nvme nvme4: mapped 1/0/0 default/read/poll queues.
nvme nvme4: Successfully reconnected (attempt 1/60)

3) Target supports CQT only (no CCR) -- with the patch applied

CCR fails as expected, patch falls back to TP4129 time-based
recovery. Controller reconnects after the recovery timer expires
and vdbench DI validation PASSES with no data integrity errors.

Kernel log:
nvme nvme4: I/O tag 0 (9000) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
nvme nvme4: starting controller fencing
nvme nvme4: attempting CCR, timeout 15000ms
nvme nvme4: CCR failed, switch to time-based recovery, timeout = 15000ms
nvme nvme4: failed nvme_keep_alive_end_io error=5
nvme nvme4: Time-based recovery finished
nvme nvme4: Reconnecting in 10 seconds...
nvme nvme4: creating 1 I/O queues.
nvme nvme4: mapped 1/0/0 default/read/poll queues.
nvme nvme4: Successfully reconnected (attempt 1/60)

4) Target supports neither CQT nor CCR -- with the patch applied

vdbench DI validation FAILED. Data integrity errors observed.
This is expected because without CQT the host has no safe hold
period and inflight IO may be retried prematurely.

Additional Targeted Tests
-------------------------

All of the following passed on a PowerFlex target with CQT + CCR.

Simple CCR:
- 2 controllers, one times out. CCR issued via surviving controller.
CCR log page entry created and read by the host successfully.

Kernel log:
nvme nvme3: I/O tag 0 (2000) type 4 opcode 0x18 (Keep Alive) QID 0 timeout
nvme nvme3: starting controller fencing
nvme nvme3: attempting CCR, timeout 15000ms
nvme nvme3: CCR succeeded using nvme4

Multi-Controller / Scale:
- 3+ controllers with multiple simultaneous CCRs. Controller A
resets B and C concurrently. Both entries tracked correctly,
completions trigger coalesced AEN.
- Cross-CCR: Controller A resets B while B resets A simultaneously.
Both operations proceed correctly.
- 4 controllers, CCR Limit set to 2, 3 controllers timed out.
2 CCRs issued, 3rd controller defaults to TP4129 time-based
recovery as expected.

AEN (Async Event Notification):
- AEN delivered on CCR completion with NVME_ASYNC_EVENT_CCR_CHANGED.
- AEN re-arm verified: after reading CCR log page (clearing
AEN_CCR_PENDING), another CCR triggers a new AEN.

Identify Controller:
- CIU is non-zero, CIRN is populated, CCRL = 4.
- CIU/CIRN values change after disconnect/reconnect (new instance).

CCR Log Page:
- After successful CCR, log page fields verified: ICID, CIU, ACID,
CCRS all populated correctly.

Summary
-------
The patch v3 works correctly in all tested scenarios. CCR recovery
functions as designed when the target supports it, and the TP4129
fallback path operates correctly when CCR is unavailable. Data
integrity is preserved in all supported configurations.
PowerFlex was running engineering code and not a production code.

Tested-by: Igor Achkinazi <igor.achkinazi@xxxxxxxx>


Internal Use - Confidential