SCSI problem and VFS question.

J. Sean Connell (ankh@canuck.gen.nz)
Thu, 9 Jan 1997 16:21:39 +1300 (NZDT)


We have a Linux-based news server here, which for various reasons is shortly
going to be shoved onto an Axil 320 (running Solaris -- *puke*). We have two
problems, actually. The first one is that periodically, the SCSI bus will get
itself into some kind of weird state where all the aic7xxx driver is doing is
issuing resets. The drives appear to be acting on the reset requests, but
either the drives aren't acknowledging the resets, or the driver isn't seeing
the acknowledgements. The card is an Adaptec 2940UW, ROM revision 1.23.
/proc/scsi says:

----CUT HERE----
Attached devices:
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: SEAGATE Model: ST32550N Rev: 0021
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: SEAGATE Model: ST32550N Rev: 0021
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: SEAGATE Model: ST32550N Rev: 0019
Type: Direct-Access ANSI SCSI revision: 02
Host: scsi0 Channel: 00 Id: 06 Lun: 00
Vendor: SEAGATE Model: ST32550N Rev: 0015
Type: Direct-Access ANSI SCSI revision: 02
----CUT HERE----

The news spool is an MD raid0 device on IDs 5 and 6 (which I'll get to
shortly). Back before the UW cage blew up and took the two 4GB Barracudas
with it, we had 0 and 1 on a 2940N controller, and this also experienced the
reset loop phenomenon. Drives 0 and 1 are brand-new, less than three weeks
old. This happened originally under 2.0.25, and upon advice from a friend who
also runs a news server on similar hardware, we downgraded to 2.0.12, which
also suffers from the same problem. (The box itself is a P166 with 128M RAM,
512M swap, and an Intel 82371SB Natoma/Triton II motherboard (according to
/proc/pci).)

The second thing is drives 5 and 6. They are the temporary replacement drives
for the two 4GB UW Barracudas that got fried by a wonky power supply. They
were also the old spool disks out of the old news server (an SS2 running SunOS
4.1.3_U1), and unforunately (this is probably 90% of our problem at the
moment) they have a few bad blocks.

SunOS and Solaris both seem to have some kind of recovery behavior for when it
tries to read an inode, but gets a bad block instead. Under Linux, however,
processes start getting stuck in wait_on_inode in uninterruptable sleep,
including init, which isn't so hot. I had a look at both the ext2 code and
the VFS code, hoping I could change the panic into "return -EIO", only to
discover that not only can the ext2 code not do that, the VFS layer doesn't
appear to be able to handle that either.

Thus, my question is, is anybody thinking/planning to extend or enhance the
VFS layer so that the "errors=continue" ext2fs mount option could become
global?

EIO may be the wrong thing, I'm not up on my POSIX, and I can't figure out
what SunOS/Solaris do; if it is, just replace it with whatever's appropriate.

If anyone can shed any light on the problems we're having with SCSI, I would
sincerely appreciate it. Feel free to email me if you need any further info
and/or output from /proc.

--
Jeffrey Connell            | Systems Adminstrator, ICONZ
ankh@canuck.gen.nz         | Any opinions stated above are not my employers',
ankh@iconz.co.nz           | not my boyfriend's, my priest's, my God's, my
#include <stddisc.h>       | my friends', and probably not even my own.
---------------------------+--------------------------------------------------
Fingerprint: 1024/2B8B116D | Key at http://www.canuck.gen.nz/~ankh/pgpkey.html