1) Both ends of the bus are now properly
terminated.
2) I've taken the CD ROM off and just used
both disk drives and the tape.
3) I can recompile kernels with no problem
and haven't had any other hint of faulty
hardware, so intermittent hardware seems
somewhat unlikely.
4) Parity is being check everywhere.
5) I suppose it's possible that addition of
the tape is over taxing the power supply,
but I sort of doubt that a 220W supply
would be having a problem keeping up with
all of this.
The thing about this is that the entire thing
does stink of some sort of kernel bug. The general
symptom is that while backing up (find+cpio), the
disk is getting torched. It's not always
consistant, but the mode I've seen most seems to
be where ll_blk_rw.c/make_request is noticing that
the block being requested is bigger than the
drive. Generally, they are completely bogus
numbers. Given that ll_rw_block() is called all
over the place, it's not clear who it is that's
calling it with the bogus block number. However,
one interesting thing is that the the buffer_head
being supplied seems sane in other respects --
it's just bh->rsector and bh->b_blocknr which are
toasted.
I've also caught in the kernel messages
complaints about directory entries as well. I've
fsck'ed my disk several times (many out of
necessity) and although I've found that there are
holes in fsck (files with huge sizes which can't
be rm'd and can only be killed with debugfs) my
file system *seems* to be intact now, yet I get
errors when I backup to the tape, but not when
I send it to /dev/null.
To me, it almost seems that SCSI is once in a
while getting mixed up -- that a write destined
for the tape is getting confused with a write
destined for the disk. Maybe it's more generic
than that, but writing the wrong data to the disk
would certainly cause a lot of hardship along the
lines of what I'm seeing. Though less frequently,
I'm also seeing other processes get toasted when
I'm doing my backups: this could indicate that
data being paged in is getting hosed.
The problem is that I'm not quite sure how to
proceed from here. I've looked over the tape
driver, and it's pretty simple minded:
"statically" allocated buffers, no tricky double
buffering schemes, seemingly standard calls to the
scsi code. The code doesn't seem to be
accidentally overwriting anything that I can tell,
and the fact that it's specific elements of
structures that are getting nuked makes it less
likely that it's just an adjacent memory kind of
issue. I've also limited some of the options with
the tape driver to not do write behind etc, to
the same result.
Are there any other hints that the SCSI code or
anything in that path have been having similar
problems? Does anybody have a suggestion as to
what sort of debugging code I could add to try
to isolate where the general vicinity of the
problem is?
-- Michael Thomas (mike@mtcc.com http://www.mtcc.com/~mike/) "I dunno, that's an awful lot of money." Beavis