Re: SCSI disk I/O error

Doug Ledford (dledford@dialnet.net)
Sun, 19 Apr 1998 03:51:01 -0500


Sean Farley wrote:
>
> At the moment I cannot think of a great subject that will catch everyone's
> attention. :)
>
> This morning when I sat down to my computer to see if some work I left for
> it to think (compile) on had gone smoothly, I noticed that although
> applications, under X, were still active, I was unable to start any new
> applications.
>
> I was able to flip over to a console (Ctrl-Alt-F1). Ctrl-ScrlLock gave me
> pages of processes running. Except for a few processes that I knew were
> running, most of the processes were xinted. I assume that the computer
> had used up all of the process slots which is why I was unable to logon at
> the console or start-up another xterm.
>
> Come to think about it, before I went to bed, I was unable to start qps
> (Qt process utility) to view the running process. I thought about looking
> at it in the morning when I woke up. I don't know if the SCSI errors
> caused the problem or something else did. The first SCSI error was well
> after I had gone to bed and had had trouble with qps.
>
> After rebooting (hard reboot), fsck proceeded to inform me that one of my
> partitions needed serious checking.
>
> Here are the assortment of errors syslog saved for me from the night
> before:
> ....
> Apr 18 02:48:48 seen kernel: scsi : aborting command due to timeout : pid
> 16668, scsi0, channel 0, id 1, lun 0 Write (6) 10 30 0a 02 00
> Apr 18 02:48:48 seen kernel: scsi : aborting command due to timeout : pid
> 16669, scsi0, channel 0, id 1, lun 0 Write (6) 10 64 cc 02 00
> Apr 18 02:48:48 seen kernel: scsi : aborting command due to timeout : pid
> 16670, scsi0, channel 0, id 1, lun 0 Write (6) 10 65 fa ec 00
> Apr 18 02:48:49 seen kernel: SCSI host 0 abort (pid 16669) timed out -
> resetting
> Apr 18 02:48:49 seen kernel: SCSI bus is being reset for host 0 channel 0.
> Apr 18 02:48:49 seen kernel: SCSI host 0 abort (pid 16670) timed out -
> resetting
> Apr 18 02:48:49 seen kernel: SCSI bus is being reset for host 0 channel 0.
> Apr 18 02:49:05 seen kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
> return code = 26030000
> Apr 18 02:49:05 seen kernel: scsidisk I/O error: dev 08:12, sector 357882,
> absolute sector 1074682
> Apr 18 02:49:05 seen kernel: SCSI disk error : host 0 channel 0 id 1 lun 0
> return code = 26030000
> Apr 18 02:49:05 seen kernel: scsidisk I/O error: dev 08:12, sector 344074,
> absolute sector 1060874
> ....
>
> After about 12 minutes of this, the file system decided to join in feeling
> that it had been left out:
> ....
> Apr 18 03:00:57 seen kernel: EXT2-fs error (device 08:12):
> ext2_write_inode: unable to read inode block - inode=2059, block=8200
> ....
>
> The odd thing is that last week I had just scanned the drive for bad
> blocks and repartitioned it.

Unfortunately, if you are using something like badblocks to find the bad
sectors, then I've found it to be somewhat arbitrary about what sectors
actually get found fom run to run depending on whether or not the drive is
able to successfully retry the operation before it runs out of retries. I
would suggest using the Adaptec SCSI BIOS disk utilities and doing a Verify
Media operation. This tends to be more reliable about mapping out bad
sectors and they are less likely to come back and haunt you in the future,
which is exactly what has happened in this case.

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu