Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR

From: Ric Wheeler
Date: Fri Feb 02 2007 - 11:16:58 EST

Next message: Eric Dumazet: "Re: Try to fill route cache"
Previous message: Bill Davidsen: "Re: [ANN] Userspace M-on-N threading model implementation. Alpharelease."
In reply to: James Bottomley: "Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR"
Next in thread: Douglas Gilbert: "Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

James Bottomley wrote:

On Fri, 2007-02-02 at 14:42 +0000, Alan wrote:

The interesting point of this question is about the typically pattern of IO errors. On a read, it is safe to assume that you will have issues with some bounded numbers of adjacent sectors.

Which in theory you can get by asking the drive for the real sector size
from the ATA7 info. (We ought to dig this out more as its relevant for
partition layout too).

Actually, my point is that damage typically impacts a cluster of disk sectors that are adjacent. Think of a drive that has junk on the platter or a some such thing - the contamination is likely to be localized.

I really like the idea of being able to set this kind of policy on a per drive instance since what you want here will change depending on what your system requirements are, what the system is trying to do (i.e., when trying to recover a failing but not dead yet disk, IO errors should be as quick as possible and we should choose an IO scheduler that does not combine IO's).

That seems to be arguing for a bounded "live" time including retry run
time for a command. That's also more intuitive for real time work and for
end user setup. "Either work or fail within n seconds"

Actually, then I think perhaps we use the allowed retries for this ...

I really am not a big retry fan for most modern drives - the drive will try really, really hard to complete an IO for us and multiple retries can just slow down the higher level application from recovering.

So you would fail a single sector and count it against the retries.
When you've done this allowed retries times, you fail the rest of the
request.

James

I think that we need to play with some of these possible solutions on some real-world bad drives and see how they react.

We should definitely talk more about this at the workshop ;-)

ric

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Eric Dumazet: "Re: Try to fill route cache"
Previous message: Bill Davidsen: "Re: [ANN] Userspace M-on-N threading model implementation. Alpharelease."
In reply to: James Bottomley: "Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR"
Next in thread: Douglas Gilbert: "Re: [PATCH] scsi_lib.c: continue after MEDIUM_ERROR"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]