LOTS OF BAD STUFF in raid0: raid0145-19990824-2.2.11 is unstable

David Mansfield (david@cobite.com)
Fri, 5 Nov 1999 12:49:07 -0500 (EST)


(system details at bottom, summary 2xPII 450, 2.2.13pre14+raid0145latest)

I am STILL having the same old bug in the raid code/kernel that has
existed for about 6 months, at least. It no longer oopses because the new
debug code that checks for this (that I suggested BTW, and others have
triggered as well). But nonetheless this is really, really bad. Does
anyone have any ideas?

* I am willing to try out test patches to track this down, as I seem to
have a way to reproduce it *

In the past this hit me about once a month. I have POUNDING the raid for
the last 24 hours (of real work, not benchmarking) and have gotten all of
these since then:

raid0_map bug: hash->zone0==NULL for block 808464440
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 808464440
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 808464440
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 171521844
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 171521844
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 171521844
Bad md_map in ll_rw_block
EXT2-fs error (device md(9,0)): ext2_free_blocks: Freeing blocks not in
datazone - block = 171521844, count = 1
raid0_map bug: hash->zone0==NULL for block 959524912
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 959524912
Bad md_map in ll_rw_block
raid0_map bug: hash->zone0==NULL for block 959524912
Bad md_map in ll_rw_block

Note: there is a new one thrown in, I have never seen the EXT2-fs error
before.

Here is my current setup:

2.2.13pre14 + raid0145-19990824-2.2.11. I know that pre14 should be
considered less 'stable' than 2.2.13 proper, but I follow linux-kernel and
that was the latest at the time, and the diff between pre14 and final had
(seemingly) nothing that could be related to this. Not to mention this
SAME BUG has been happening for many, many months, since 2.2.2 at least.
If you think this is the problem I will upgrade to 2.2.13 ASAP, but I
really don't think this is the cause: this is a chronic problem.

Hardware: Dual PII 450, DAC960 hardware raid card (completely separate
from the raid in question). The raid0 is built on 6 Seagate Cheetah 9gb
drives connected to an Adaptec 2940U2W controller. The system hardware is
all from VA Research (more reliable??). Also in the system IDE cdrom (not
accessed since last reboot). 1GB ram.

OS software is RedHat 5.1 with upgrade patches to 5.2. Compiler is gcc
2.7.2.3.

Application software is a multithreaded sort program (in house code) that
is sorting 4 million records in a fixed length flat file. It is VERY io
and CPU intensive. 100% cpu utilization (on both CPUs) and about 40mb per
second IO. 40mb per second, it may not be stable, but it sure is FAST!
Also running concurrently was a gzip of one of the output files of this
process.

David

-- 
/==============================\
| David Mansfield              |
| david@cobite.com             |
\==============================/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/