Re: ext2 filesystem corruption?!?!?? (fwd)

Doug Ledford (dledford@dialnet.net)
Fri, 04 Apr 1997 23:24:37 -0600


--------

> | A return of 0 proves its neither hardware or ext2, but a failure does not
> | indicate anything.
>
> I think a return of 0 proves that the hardware/driver/cabling/etc works
> okay. badblocks doesn't go through the ext2fs filesystem, though, so it
> doesn't prove that the filesystem driver is bug-free. If you get errors,
> on the other hand, it can't be from ext2fs, so it must be one of the
> hardware/driver/cabling/etc things involved.

Correct. And it's very useful information to have at that. If you can
produce corruption problems without going through the ext2fs code, then you
have hardware corruption of some sort. An example of some of the things in
the past that I have personally seen cause hardware corruption which made one
*THINK* that something was wrong with the ext2fs code when there wasn't:

1. Bad CPU fans on pentium and high speed 486 machines
2. Bad SCSI cables
3. Memory timing settings in BIOS being just a tad too aggressive
4. Bad memory
5. Bad Pipeline Burst (or other) cache
6. Too long of a SCSI or IDE cable
7. Interference between SCSI and IDE cables running in close proximity to
each other
8. Flaky CPU (had been overclocked and partially burnt out)
9. Esoteric BIOS options being enabled when they shouldn't be (this takes
some experimentation to find and fix, a change BIOS settings, test to
see if problem is gone, if not, reboot and change settings again type
thing)

These are a few examples. A second thing to keep in mind is that the ext2fs
is a rather fast filesystem by unix standards (it beats the hell out of the
EAFS HTFS DTFS etc filesystems from SCO, but who's comparing SCO to linux
anyway :) so if you have hardware corruption problems that don't show up
except under heavy load, ext2fs is a good filesystem to bring those out :)

And of course, the very reason I posted my original email as part of this
thread. A person needs to always keep in mind that if they are getting ext2fs
errors about corruption, this does *NOT* always mean the ext2fs is at fault.
It means that somewhere along the way, either due to code in the ext2fs, or
code in the block driver you are using, or code in the low level driver you
are using, or somewhere between the CPU, RAM, cache, bus, controller, drive
bus, drive, and magnetic media, something is getting corrupted. It is
important in these cases to try and isolate software faults from hardware
faults. The purpose of the "script" I posted was to give a convenient way of
trying to narrow down the line between hardware and software. There is still
software involved with that script, but not as much. You are down to just the
badblocks program, the various buffer mechanisms, and the block driver itself
(with its underlying low level driver). Generally speaking, the buffer cache
is considered to be safe code, so you can rule that out. Most of the block
drivers are considered to be the same, so they can be ruled out. This leaves
the underlying low level driver and the badblocks program as suspect. The
badblocks program is rather simple in design, and an inspection of the source
will result in the conclusion that it too can be ruled out (not to mention how
many times it's been used to find these problems, yet I've never once heard of
it causing sectors that are fine to be mapped as bad unless the underlying
driver had problems). That means that the script I posted is really stressing
hardware and your underlying low level driver. All in all, that greatly
reduces the number of variables to look at. So, a failure during the testing
by the badblocks program gives a person somewhere to look. They can either
fiddle with compile options for their low level driver, or they can start the
process of trying to enable/disable things in the computer's BIOS to try and
find a culprit (disable cache this run, delay memeory timings that run, etc)
which then allows a person to try and pinpoint the exact problem, get it
fixed, and be on their way :) Further, as long as you fail this test, there
is no sense at all in even looking at the ext2fs code since you won't know if
you've fixed anything by changing it unless something you did just happened to
slow things down enough to keep the problem from showing up. In this case,
instead of slowing the machine down to be reliable and leaving fast code in
place, you've slowed the code down so it doesn't break your faulty hardware.

Now, having said all of that :) to ejt@bigband.ior.com:

I looked at the hardware setup you posted, and in my experience, the BusLogic
driver is as stable and error free as they come, as is the card itself.
However, if your are able to reproduce these problems with the badblocks test,
the first place I would look if I were you is into the cache setup and the
setup of the PCI bus in the BIOS to see if you can fix problems there. I
would also like to point out to people on this list, that I have in the past
seem computers fail, more than once, with memory problems in main RAM even
when DOS based 16 bit memory testers passed the RAM. So, since I haven't used
memtest86, I don't know if it falls in this category, but if it does, I
wouldn't put too much faith in its ability to find bad RAM, instead I would do
a series of make -j zlilo compile sinstead, it tends to find faulty RAM better
than most dos based memory testers (of course, you have to have a lot of RAM
to do a blind -j compile, otherwise you need to specify a maximum number of
parallel compiles to run, and I've seen problems with parallel compiles if one
of the compile targets include the NCR scsi driver because of the way it's
compiled and some magic ln/rm commands done during the make process that
clobber each other when done in parallel, but my experience with that is
compiling the NCR driver as a module).

Does anyone else here think that maybe this thread ought to be saved and
turned into an ext2fs_corruption FAQ in the linux documentation? It seems
like every so often this thread pops up with similar results. Maybe one line
of code gets changed here or there (sometimes), but usually, the person in
question has some hardware problems causing the grief.

-- 
*****************************************************************************
* Doug Ledford                      *   Unix, Novell, Dos, Windows 3.x,     *
* dledford@dialnet.net    873-DIAL  *     WfW, Windows 95 & NT Technician   *
*   PPP access $14.95/month         *****************************************
*   Springfield, MO and surrounding * Usenet news, e-mail and shell account.*
*   communities.  Sign-up online at * Web page creation and hosting, other  *
*   873-9000 V.34                   * services available, call for info.    *
*****************************************************************************