Re: ext2 filesystem corruption?!?!?? (fwd)

Doug Ledford (
Fri, 04 Apr 1997 06:28:04 -0600


> This is certainly true. There are some very large sites running Linux
> who don't seem to be having any problems. It would be good to know what
> type of controllers are being used by the people having problems. It
> would also be extreamly helpful to have a reproducable way of creating
> this problem that was simpler than running news for a few days.

I'll go one step further with this. I would recommend that the people having
problems with ext2fs corruption run the following test (if possible):

Let's say you have a hard drive partition of decent size that you don't mind
losing the data on (or even if you do mind, this test can turn up a lot of
errors so if you have an inconvenient way of getting back, then you should
probably do this anyway)

First, get the exact size of the partition (or the whole drive as the case may
be in some circumstances) in 1K blocks.

Divide this total number of blocks into 4 equal chunks (most drives do this
easily, some may have a few odd sized chunks).

Write a script like this:

badblocks -w -s -b 1024 -o /tmp/list.1 /dev/??? (blocks * .25) 0 &
badblocks -w -s -b 1024 -o /tmp/list.2 /dev/??? (blocks * .5) (blocks * .25) &
badblocks -w -s -b 1024 -o /tmp/list.3 /dev/??? (blocks * .75) (blocks * .5) &
badblocks -w -s -b 1024 -o /tmp/list.4 /dev/??? (blocks) (blocks * .75) &

A simple shell script like this will run four simultaneous badblocks programs
on the drive. A person can then check the files in the /tmp directory to see
if any were returned as bad. With modern IDE or SCSI drives, all of these
files should have a zero length unless one of two things is true. One, you
have a drive developing too many bad sectors to be mapped out (which is cause
for alarm in itself) or two, you have corruption in your low level driver (or
other low level hardware such as memory or cache or bus transfer problems).
If these test return all 0 length files, then we should start looking else
where for the problem. Run the test several times, as a single pass may not
show the problem. If you are really courageous, you can try doubling the
tests by splitting the drive into 8 equal chunks (or if you have two drives
you can do both drives at four chunks each at the same time). This is a
standard test I use with the aic7xxx driver to find problems with tagged
queueing and high commands per lun values. It seems to show problems much
quicker than any filesystem activity would (in my case, I had as many as 24 of
these running simultaneously on 6 drives in order to test this out, talk about
a dog slow machine, it took about 5 minutes just to start X windows under this
In any case, running tests like these to rule out hardware corruption would
help greatly in increasing the level of confidence that somehow the ext2fs
layer is at fault (which I personally don't think it is except under very rare
occasions since I have a hard hit news server running that filesystem without
problems, but I've taken the care and gone to the lengths to run these test on
the particular hardware in that machine and identified bad combinations that
can cause problems and worked around them at the driver level).

* Doug Ledford                      *   Unix, Novell, Dos, Windows 3.x,     *
*    873-DIAL  *     WfW, Windows 95 & NT Technician   *
*   PPP access $14.95/month         *****************************************
*   Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
*   communities.  Sign-up online at * Web page creation and hosting, other  *
*   873-9000 V.34                   * services available, call for info.    *