Re: [RFC PATCH] scsi: Add failfast mode to avoid infinite retry loop

From: Eiichi Tsukata
Date: Mon Aug 26 2013 - 05:33:09 EST


(2013/08/23 22:19), James Bottomley wrote:
On Fri, 2013-08-23 at 18:10 +0900, Eiichi Tsukata wrote:
(2013/08/21 3:09), Ewan Milne wrote:
On Tue, 2013-08-20 at 16:13 +0900, Eiichi Tsukata wrote:
(2013/08/19 23:30), James Bottomley wrote:
On Mon, 2013-08-19 at 18:39 +0900, Eiichi Tsukata wrote:
Hello,

This patch adds scsi device failfast mode to avoid infinite retry loop.

Currently, scsi error handling in scsi_decide_disposition() and
scsi_io_completion() unconditionally retries on some errors. This is because
retryable errors are thought to be temporary and the scsi device will soon
recover from those errors. Normally, such retry policy is appropriate because
the device will soon recover from temporary error state.
But there is no guarantee that device is able to recover from error state
immediately. Some hardware error may prevent device from recovering.
Therefore hardware error can results in infinite command retry loop. In fact,
CHECK_CONDITION error with the sense-key = UNIT_ATTENTION caused infinite
retry loop in our environment. As the comments in kernel source code says,
UNIT_ATTENTION means the device must have been a power glitch and expected
to immediately recover from the state. But it seems that hardware error
caused permanent UNIT_ATTENTION error.

To solve the above problem, this patch introduces scsi device "failfast mode".
If failfast mode is enabled, retry counts of all scsi commands are limited to
scsi->allowed(== SD_MAX_RETRIES == 5). All commands are prohibited to retry
infinitely, and immediately fails when the retry count exceeds upper limit.
Failfast mode is useful on mission critical systems which are required
to keep running flawlessly because they need to failover to the secondary
system once they detect failures.
On default, failfast mode is disabled because failfast policy is not suitable
for most use cases which can accept I/O latency due to device hardware error.

To enable failfast mode(default disabled):
# echo 1> /sys/bus/scsi/devices/X:X:X:X/failfast
To disable:
# echo 0> /sys/bus/scsi/devices/X:X:X:X/failfast

Furthermore, I'm planning to make the upper limit count configurable.
Currently, I have two plans to implement it:
(1) set same upper limit count on all errors.
(2) set upper limit count on each error.
The first implementation is simple and easy to implement but not flexible.
Someone wants to set different upper limit count on each errors depends on the
scsi device they use. The second implementation satisfies such requirement
but can be too fine-grained and annoying to configure because scsi error
codes are so much. The default 5 times retry may too much on some errors but
too few on other errors.

Which would be the appropriate implementation?
Any comments or suggestions are welcome as usual.

I'm afraid you'll need to propose another solution. We have a large
selection of commands which, by design, retry until the command exceeds
it's timeout. UA is one of those (as are most of the others you're
limiting). How do you kick this device out of its UA return (because
that's the recovery that needs to happen)?

James



Thanks for reviewing, James.

Originally, I planned that once the retry count exceeds its limit,
a monitoring tool stops the server with the scsi prink error message
as a trigger.
Current failfast mode implementation is that the command fails when
retry command exceeds its limit. However, I noticed that only printing error messages
on retry counts excess without changing retry logic will be enough
to stop the server and take fail over. Though there is no guarantee that
userspace application can work properly on disk failure condition.
So, now I'm considering that just calling panic() on retry excess is better.

For that reason, I propose the solution that adding "panic_on_error" option to
sysfs parameter and if panic_on_error mode is enabled the server panics
immediately once it detects retry excess. Of course, it is disabled on default.

I would appreciate it if you could give me some comments.

Eiichi
--

For what it's worth, I've seen a report of a case where a storage array
returned a CHECK CONDITION with invalid sense data, which caused the
command to be retried indefinitely.

Thank you for commenting, Ewan.
I appreciate your information about indefinite retry on CHECK CONDITION.

I'm not sure what you can do about
this, if the device won't ever complete a command without an error.
Perhaps it should be offlined after sufficiently bad behavior.

I don't think you want to panic on an error, though. In a clustered
environment it is possible that the other systems will all fail in the
same way, for example.

-Ewan


Yes, basically the device should be offlined on error detection.
Just offlining the disk is enough when an error occurs on "not" os-installed
system disk. Panic is going too far on such case.

However, in a clustered environment where computers use each its own
disk and
do not share the same disk, calling panic() will be suitable when an
error
occurs in system disk.

However, when not in a clustered environment, it won't be. Decisions
about whether to panic the system or not are user space policy, and
should not be embedded into subsystems. What we need to do is to come
up with a way of detecting the condition, reporting it and possibly
taking some action.

Because even on such disk error, cluster monitoring
tool may not be able to detect the system failure while heartbeat can
continue
working.
So, I think basically offlining is enough and also, panic is necessary
on some cases.

Offline seems a bit drastic ... what happens if you send it a target
reset?

James


I see. Users should decide whether or not to panic.
As Ric says, that should be done on file system or higher layer.

I'm now considering about handling SCSI error in user space with printk
error message as a fail over trigger. Currently, is there a nice
way to detect indefinite retry on SCSI layer?
/proc/sys/dev/scsi/logging_level can show detailed information about scsi
command but too much to detect indefinite retry.
How about adding printk error message when retry count exceed scmd->allowed
on each SCSI command?

Eiichi
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/