Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post]block: replace barrier with sequenced flush]

From: Hannes Reinecke
Date: Wed Sep 01 2010 - 03:32:36 EST

Next message: KOSAKI Motohiro: "Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake"
Previous message: Armin Steinhoff: "Re: UIO and Fedora 13 (kernel 33.6)"
Next in thread: Hannes Reinecke: "Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post]block: replace barrier with sequenced flush]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Mike Snitzer wrote:
> Hi Kiyoshi,
>
> On Mon, Aug 30 2010 at 2:13am -0400,
> Kiyoshi Ueda <k-ueda@xxxxxxxxxxxxx> wrote:
>
>>> That does seem like a valid concern. But I'm not seeing why its unique
>>> to SYNCHRONIZE CACHE. Any IO that fails on the target side should be
>>> passed up once the error gets to DM.
>> See the Tejun's explanation again:
>> http://marc.info/?l=linux-kernel&m=128267361813859&w=2
>> What I'm concerning is whether the same thing as Tejun explained
>> for ATA can happen on other types of devices.
>>
>>
>> Normal write command has data and no data loss happens on error.
>> So it can be retried cleanly, and if the result of the retry is
>> success, it's really success, no implicit data loss.
>>
>> Normal read command has a sector to read. If the sector is broken,
>> all retries will fail and the error will be reported upwards.
>> So it can be retried cleanly as well.
>
> I reached out to Fred Knight on this, to get a more insight from a pure
> SCSI SBC perspective, and he shared the following:
>
> ----- Forwarded message from "Knight, Frederick" <Frederick.Knight@xxxxxxxxxx> -----
>
>> Date: Tue, 31 Aug 2010 13:24:15 -0400
>> From: "Knight, Frederick" <Frederick.Knight@xxxxxxxxxx>
>> To: Mike Snitzer <snitzer@xxxxxxxxxx>
>> Subject: RE: safety of retrying SYNCHRONIZE CACHE?
>>
>> There are requirements in SBC to maintain data integrity. If you WRITE
>> a block and READ that block, you must get the data you sent in the
>> WRITE. This will be synchronized around the completion of the WRITE.
>> Before the WRITE completes, who knows what a READ will return. Maybe
>> all the old data, maybe all the new data, maybe some mix of old and new
>> data. Once the WRITE ends successful, all READs of those LBAs (from any
>> port) will always get the same data.
>>
>> As for errors, SBC describes how the deferred errors are reported (like
>> when a CACHE tries to flush but fails). So if a write from cache to
>> media does have problems, the device would tell you via a CHECK
>> CONDITION (with the first byte of the sense data set to 71h or 73h. SBC
>> clause 4.12 and 4.13 cover a lot of this information. It is these error
>> codes that prevent silent loss of data. And, in this case, when the
>> CHECK CONDITION is delivered, it will have nothing to do with the
>> command that was issued (the victim command). If you look into the
>> sense data, you will see the deferred error flag, and all the additional
>> information fields will relate to the original I/O
>>
>> SYNCHRONIZE CACHE is not substantially different than a WRITE (it puts
>> data on the media). So issuing it multiple times wouldn't be any
>> different than issuing multiple WRITES (it might put a temporary dent in
>> performance as everything flushes out to media). If it or any other
>> commands fail with 71h/73h, then you have to dig down into the sense
>> data buffer to find out what happened. For example, if you issue a
>> WRITE command, and it completes into write back cache but later (before
>> being written to the media), some of the cache breaks and looses data,
>> then the device must signal a deferred error to tell the host, and cause
>> a forced error on the LBA in question.
>>
>> Does that help?
>>
>> Fred
> ----- End forwarded message -----
>
> Seems like verifying/improving the handling of CHECK CONDITION is a more
> pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
> retries. Without proper handling we could completely miss these
> deferred errors.
>
Yes.

> But how to effectively report such errors to upper layers is unclear to
> me given that a particular SCSI command can carry error information for
> IO that was already acknowledged successful (e.g. to the FS).
>
> drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
> illustrate Linux's current CHECK CONDITION handling. I need to look
> closer at how deferred errors propagate to upper layers. After an
> initial look it seems scsi_error.c does handle retrying commands where
> appropriate.
>
> I believe Hannes has concerns/insight here.
>

Quite. We _should_ be handling deferred errors correctly;
if you check drivers/scsi/scsi_lib.c:scsi_io_completion()
you'll find this:

if (host_byte(result) == DID_RESET) {
/* Third party bus reset or reset for error recovery
* reasons. Just retry the command and see what
* happens.
*/
action = ACTION_RETRY;
} else if (sense_valid && !sense_deferred) {
...
} else {
description = "Unhandled error code";
action = ACTION_FAIL;
}

ie for deferred errors we're already aborting the command. Not sure
if I agree with this bit in drivers/scsi/scsi_lib.c:

static int scsi_check_sense(struct scsi_cmnd *scmd)
{
struct scsi_device *sdev = scmd->device;
struct scsi_sense_hdr sshdr;

if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */

if (scsi_sense_is_deferred(&sshdr))
return NEEDS_RETRY;

I doubt we can resolve the situation by retrying the command, which
will be the wrong command to retry anyway. I would rather
have those retry 'SUCCESS' and add another case in scsi_io_completion()
to notify us about the deferred error.

I'll be sending a patch.

Cheers,

Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@xxxxxxx +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: KOSAKI Motohiro: "Re: [PATCH 2/3] mm: page allocator: Calculate a better estimate of NR_FREE_PAGES when memory is low and kswapd is awake"
Previous message: Armin Steinhoff: "Re: UIO and Fedora 13 (kernel 33.6)"
Next in thread: Hannes Reinecke: "Re: safety of retrying SYNCHRONIZE CACHE [was: Re: [PATCHSET block#for-2.6.36-post]block: replace barrier with sequenced flush]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]