Re: [PATCH] nvme-tcp: Check if request has started before processing it

From: Hannes Reinecke
Date: Mon Mar 01 2021 - 14:37:30 EST


On 3/1/21 5:05 PM, Keith Busch wrote:
On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
On 3/1/21 2:26 PM, Daniel Wagner wrote:
On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?

If we receive an response for which we don't have a started request, we
know that something is wrong. Couldn't we in just reset the connection
in this case? We don't have to pretend nothing has happened and
continuing normally. This would avoid a host crash and would not create
(more) data corruption. Or I am just too naive?

This is actually a sensible solution.
Please send a patch for that.

Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.

We need to distinguish two cases here.
The one is use receiving a frame with an invalid tag, leading to a crash. This can be easily resolved by issuing a reset, as clearly the command was garbage and we need to invoke error handling (which is reset).

The other case is us receiving a frame with a _duplicate_ tag, ie a tag which is _currently_ valid. This is a case which will fail _even now_, as we have simply no way of detecting this.

So what again do we miss by fixing the first case?
Apart from a system which does _not_ crash?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer