Re: [PATCH] nvme-tcp: Check if request has started before processing it

From: Hannes Reinecke
Date: Fri Feb 26 2021 - 11:43:48 EST


On 2/26/21 5:13 PM, Keith Busch wrote:
On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
On 2/26/21 1:35 PM, Daniel Wagner wrote:
On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
Well, I think we should probably figure out why that is happening first.

I got my hands on a tcpdump trace. I've trimmed it to this:

[ .. ]
NVM Express Fabrics TCP
Pdu Type: CapsuleResponse (5)
Pdu Specific Flags: 0x00
.... ...0 = PDU Header Digest: Not set
.... ..0. = PDU Data Digest: Not set
.... .0.. = PDU Data Last: Not set
.... 0... = PDU Data Success: Not set
Pdu Header Length: 24
Pdu Data Offset: 0
Packet Length: 24
Unknown Data: 02000400000000001b0000001f000000

0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4 .L..@.@.....&...
0020 c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18 ...D.O.X....2...
0030 01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3 ................
0040 5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ]...............
0050 00 00 1b 00 00 00 1f 00 00 00 ..........

As I suspected, we did receive an invalid frame.
Data digest would have saved us, but then it's not enabled.

So we do need to check if the request is valid before processing it.

That's just addressing a symptom. You can't fully verify the request is
valid this way because the host could have started the same command ID
the very moment before the code checks it, incorrectly completing an
in-flight command and getting data corruption.

Oh, I am fully aware.

Bad frames are just that, bad frames.
We can only fully validate that when digests are enabled, but I gather that controllers sending out bad frames wouldn't want to enable digests, either. So relying on that is possibly not an option.

So really what I'm trying to avoid is the host crashing on a bad frame.
That kind of thing always resonates bad with customers.
And tripping over an uninitialized command is just too stupid IMO.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer