What are the possible causes of a temporary, partial system hang/freeze?
I've tried kernels from 2.2.10 to 2.2.16 with no luck stopping these
hangs/freezes. Specifically, the system (a web server):
* stops responding to most requests for up to one or two
minutes, then comes back to life;
* hangs only (as far as I can tell) when there is some
load on the server;
* discontinues responding to requests from Apache daemon --
Apache waits and waits, then sends out an 'internal
error' message;
* responds to a few simple command line requests from
root (things like ls, cat, and file) but hangs when
a command like "ls -l" is used;
* continues to respond to queries made through an NFS
connection, including reading files from the server
(even though root at the exact same time is frozen
at the console);
* the 'hangs' occur equally on IDE and SCSI drives;
* no error messages are generated to the console or to
the logs (I have remote logging to another server,
so it isn't a matter of not being able to write the
messages out to the drives).
My first guess was that this had something to do with my SCSI Adaptec
2940U2W controller or my LVD disks (Seagate Cheetahs). However, I've
kicked away at that idea for three months, trying many solutions. It may
have something to do with running software RAID-1, but it hangs the same
when the system is in degraded mode with only one SCSI disk operational.
According to the developer/maintainer of the RAID-1 code, Mingo, it is
highly unlikely to be a problem specifically with the RAID code. And if
I
can 'cat' a file during the freeze, then it isn't a disk system freeze
(how could I read from disk?).
I suspect it is something to do with taking a Debian Potato system and
putting a vanilla kernel on it patched with the aic7xxx driver and the
RAID code. Many people are running these patched vanilla kernels on
Redhat systems, so it seems this is something specific to Debian.
My system is:
ASUS P3B-F motherboard (Intel 440BX AGPset);
CPU Bus/PCI Freq - 100.3/33.43
Single PII 400Mhz (Deschutes)
512MB 100Mhz ECC RAM
Adaptec 2940U2W (2.20.0 bios) -- Tagged Command Queueing enabled; max 32
commands per device; reset delay 5 sec.
SCSI Cable is 3' internal, teflon, custom made Ultra2-LVD with active
negation terminator
2xSeagateST39103LW (Ultra2-LVD drives)
NE2000 Clone (ISA)
I'd appreciate any suggestions. So far, the main things I've tried are:
* testing kernels 2.2.10 to 2.2.16 with appropriate RAID
patches (no differences);
* replacing original Adaptec cable and terminator with custom built,
high-end, teflon cable and terminator ($125US);
* set up remote logging to catch any error message (none logged);
* added verbose debugging (nothing);
* upgraded the AIC7xxx driver from 5.1.21 to 5.1.30 (performance
improved but it still hangs);
* removed the IDE drives and kernel support for IDE, based on
someone's hunch;
* lowered the front bus speed from 100.3Mhz to 88.3 (underclocking
CPU at the same time), based on another hunch;
* checked IRQs for conflicts -- none ;
* compiled the kernel with and without tagged command queuing,
with more and fewer max commands;
* added hard disk fans;
* reduced SCSI controller speed on Seagate drives to 40MB
from 80MB;
Again, any suggestions appreciated.
Thanks in advance,
Jeff Hill
-- ------------------------------------------------------------ ------ HR On-Line: The Network for Workplace Issues ------ http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708 ------------------------------------------------------------- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:27 EST