Re: What does this scsi error mean ?

From: linux-os \(Dick Johnson\)
Date: Tue Jan 16 2007 - 10:17:02 EST



On Mon, 15 Jan 2007, Olivier Galibert wrote:

> On Mon, Jan 15, 2007 at 06:45:40PM +0000, Alan wrote:
>> On Mon, 15 Jan 2007 18:16:02 +0100
>> Olivier Galibert <galibert@xxxxxxxxx> wrote:
>>
>>> sd 0:0:0:0: SCSI error: return code = 0x08000002
>>> sda: Current: sense key: Hardware Error
>>> ASC=0x42 ASCQ=0x0
>>
>> I'll give you a clue: The words "Hardware Error".
>>
>> Run a SCSI verify pass on the drive with some drive utilities and see
>> what happens. If you are lucky it'll just reallocate blocks and decide
>> the drive is ok, if not well see what the smart data thinks.
>
> Both smart and the internal blade diagnostics say "everything is a-ok
> with the drive, there hasn't been any error ever except a bunch of
> corrected ECC ones, and no more than with a similar drive in another
> working blade". Hence my initial post. "Hardware error" is kinda
> imprecise, so I was wondering whether it was unexpected controller
> answer, detected transmission error, block write error, sector not
> found... Is there a way to have more information?
>
> OG.

Correctable SCSI errors show that the data in a sector was not properly
read, but the device was able to fix the data error because of the
redundancy in the CRC. The error could be permanently fixed is you
rewrote the sector. You probably don't know where the bad sector is
without adding a printk() to driver code. Some BIOS SCSI utilities
(Adaptec) have the capability of reading an entire drive and fixing
bad sectors either by rewrite or relocation. Since drives can be
accessed as files, you could write a utility that opens the RAW
device with in NOT mounted, reads a bunch of sectors, then writes
them back. To do this, you need to verify that lseek() works on
your particular drive because you need to write the data back to
the same offset that you read it from. I mention this because
the raw r/w of an early Adaptec (aha1542) driver, didn't impliment
lseek, just returned 'okay'. You can imagine the mess I made of
a drive with that controller!

Once you verify that lseek works, the rest of the code is trivial.
I suggest reading then writing 64 kilobytes at a time. It will seem
to take 'forever', but the retries on these relatively short groups
of sectors (128 sectors), will be short when errors are encountered.

Make sure the drive is either not mounted or mounted r/o.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.67 BogoMips).
New book: http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@xxxxxxxxxxxx - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/