Re: SCSI disk I/O error

Doug Ledford (dledford@dialnet.net)
Sun, 19 Apr 1998 12:17:33 -0500


Sean Farley wrote:
> I'll give it a whirl. I just need to move the files off of the drive.

That or you have to run e2fsck -c on the drive multiple times and write down
*every* block number that gives media errors and then try to map those into
their associated files and simply replace those files after you do the
Verify Media operation in the BIOS.

> Why would a bad block make the whole system unstable? The drive in
> question is merely a data drive.

Data drive or not it still has filesystem meta data on it. If the system
can't read that meta data then bad things happen (that was the cause of the
EXT2-FS PANIC: message you had). Additionally, the drive will retry bad
sectors multiple times. On occasion, I've seen drives think that they have
succeeded when they haven't, then you get corrupted data from the drive
(maybe in the qps binary). Once that toasty data is then cached, you have
to fluch the cache and try to reload it before things have a chance of
working again. Note, this need not generate any error messages if the
controller doesn't have to retry the command but instead the drive managed
to *think* that it got things right and completed the command only it
completed it with bogus data and marked the SENSE information as being a
recovered error. In those cases, you won't see any messages but "bad"
things will happen.

> When I tried to run qps, it failed to
> load three hours before the first SCSI error showed up. qps is on a
> different drive. At that time qps would not run, but I was able to start
> an xterm.

See above. I would think you just had a hosed copy of qps in your cache
(for whatever reason, it could be a memory scribbly somewhere instead of a
bad drive read as well). In those cases, do something like a bonnie run
with a file size of at least your RAM size and it should flush the cache of
unneeded info and then when qps is reloaded it might work properly again.

> If I do have a bad block, something in the kernel seemed to have
> overreacted. At least IMHO. :)

The kernel doesn't like filesystem errors, and understandably so. After
all, what do you do if you can't read a block of metadata and you need to
write just a few bits into that same block? Do you hose the whole block by
trying to continue, or do you panic that fs so that the machine can survive
until the next reboot and get things cleaned up?

-- 

Doug Ledford <dledford@dialnet.net> Opinions expressed are my own, but they should be everybody's.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu