Re: [PATCH] nvme-tcp: wait socket wmem to drain in queue stop

From: Sagi Grimberg
Date: Sun Apr 13 2025 - 18:25:16 EST




On 05/04/2025 8:48, Michael Liang wrote:
This patch addresses a data corruption issue observed in nvme-tcp during
testing.

Issue description:
In an NVMe native multipath setup, when an I/O timeout occurs, all inflight
I/Os are canceled almost immediately after the kernel socket is shut down.
These canceled I/Os are reported as host path errors, triggering a failover
that succeeds on a different path.

However, at this point, the original I/O may still be outstanding in the
host's network transmission path (e.g., the NIC’s TX queue). From the
user-space app's perspective, the buffer associated with the I/O is considered
completed since they're acked on the different path and may be reused for new
I/O requests.

Because nvme-tcp enables zero-copy by default in the transmission path,
this can lead to corrupted data being sent to the original target, ultimately
causing data corruption.

This is unexpected.

1. before retrying the command, the host shuts down the socket.
2. the host sets sk_lingerime to 0, which means that
as soon as the socket is shutdown - the packet should not be able to transmit again
on the socket, zero-copy or not. Perhaps there is something not handled correctly
with linger=0? perhaps you should try with linger=<some-timeout> ?