On 2014/2/7 13:46, James Bottomley wrote:On Fri, 2014-02-07 at 09:22 +0900, Eiichi Tsukata wrote:Currently, scsi error handling in scsi_io_completion() tries to
unconditionally requeue scsi command when device keeps some error state.
For example, UNIT_ATTENTION causes infinite retry with
action == ACTION_RETRY.
This is because retryable errors are thought to be temporary and the scsi
device will soon recover from those errors. Normally, such retry policy is
appropriate because the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error stateCould you please add an analysis of the actual failure; which devices
immediately. Actually, we've experienced an infinite retry on some hardware.
Therefore hardware error can results in infinite command retry loop.
and what conditions.
same question, can you explain?
This patch adds 'retry_timeout' sysfs attribute which limits the retry timeDon't do this ... you're mixing a feature (which you'd need to justify)
of each scsi command. This attribute is located in scsi sysfs directory
for example "/sys/bus/scsi/devices/X:X:X:X/" and value is in seconds.
Once scsi command retry time is longer than this timeout,
the command is treated as failure. 'retry_timeout' is set to '0' by default
which means no timeout set.
with an apparent bug fix.
Once you dump all the complexity, I think the patch boils down to a
simple check before the action switch in scsi_io_completion():
if (action != ACTION_FAIL &&
time_before(cmd->jiffies_at_alloc + wait_for, jiffies)) {
action = ACTION_FAIL;
description = "command timed out";
}
James
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/
.