Problem in SCSI error handling 2.4/2.6

From: Mark Mokryn
Date: Thu May 13 2004 - 03:23:47 EST


According to the SCSI spec, a LUN may abort all outstanding commands in case of an error (bit QERR set in the control mode page).
This may occur on many SCSI/FC drives or storage systems, and will certainly the case for SATA (libata) when dealing with NCQ or TCQ drives.

The problem is that the Linux SCSI error handler (2.4 & 2.6) identically handles commands failed due to MEDIUM_ERROR and ABORTED_COMMAND by marking both types as NEEDS_RETRY.

What we have seen in such a case is that the error handler will simply requeue these commands, and in most cases, the exact scenario (several commands requeued and then aborted due to a single medium error) will be repeated ad nauseum until the retry limit. The result is often that all of the aborted commands are needlessly failed.

The correct fix is to never retry commands failed due to medium error. Rest assured that when a drive returns this status, exhaustive retries and error correction algorithms have been applied at the drive level. No storage system has the incentive of returning medium error if the error is recoverable.

If the error handler insists on retrying such commands, then at least set a lower retry limit on medium errors (though I believe this is pointless, and may just cause more aborted commands).

In any case - setting the same retry limit on medium errors and aborted commands is a bug.

-Mark



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/