Re: Serious ext2fs problem

Stephen C. Tweedie (sct@dcs.ed.ac.uk)
Wed, 1 Jan 1997 14:36:25 GMT


Hi,

On Tue, 31 Dec 1996 13:21:15 -0600 (CST), Chris Adams <cadams@ro.com>
said:

> Once upon a time, Stephen C. Tweedie wrote
>> In article <199612311454.IAA18638@sh1.ro.com> Chris Adams
>> <cadams@ro.com> writes:
>> > In my case, the filesystem error in questions is
>>
>> > EXT2-fs error (device 08:31): ext2_new_block: Free blocks count
>> > corrupted for block group 546
>>
>> > This is on a 10 gig RAID 0 news spool (on a DPT 3224). The system is
>> > running Linux 2.0.27 + the noatime patch and INN 1.4unoff4. When I run
>> > e2fsck (1.04 that comes with RedHat 4.0 and 1.06 from tsx-11), no errors
>> > are reported.
>>
>> Bad SCSI bus.
>>
>> This is almost certainly bad hardware, ...

> Once the errors start happening, they keep happening at the same place
> over and over again. However, if I stop INN, umount the drive, mount
> the drive, and start INN, everything is happy (I had it giving me the
> above error twice everytime I tried to tell INN to "go"). Since I
> umounted the drive and remounted it, it has been running just fine for
> ~24 hours. That doesn't seem like hardware to me.

It sounds just like the hardware problems I have seen before. Trouble
is, if a block is corrupted during read, then being a bitmap block the
error is likely to persist in your cache. If e2fsck doesn't find any
error later on, this just means that the bad bitmap has not been
written back to disk. That just means that the kernel has not tried
to reallocate any of the blocks from the particular block group in
question.

> I already have three 1 gig drives on an Adaptec 2940 (one for OS, swap,
> and INN; one for history; and one for overviews). The news spool is
> five 2 gig drives in a RAID 0 array on a DPT 3224. I have also had the
> problem on the 1 gig drives drives, although not in the last week or two.
> These are all drives, cables, and adapters that have worked fine for six
> months. And yes, the case is well cooled (we have a thermometer on it
> that I check every day or two), so I don't think the drives are all
> going bad from heat at the same time.

Bad drives would normally show up as CRC errors, not silent data
corruption. This still sounds more like a hardware problem, though:
are you able to enable parity checking on your SCSI buses? On the
other hand, if you can track down the problem to a specific kernel
upgrade, that might help to identify a software problem.

Cheers,
Stephen.