Kernel Hangs Temporarily, Partially: Possible Causes?

From: Jeff Hill (jhill@hronline.com)
Date: Mon Jun 12 2000 - 23:15:56 EST


What are the possible causes of a temporary, partial system hang/freeze?

I've tried kernels from 2.2.10 to 2.2.16 with no luck stopping these
hangs/freezes. Specifically, the system (a web server):

* stops responding to most requests for up to one or two
  minutes, then comes back to life;

* hangs only (as far as I can tell) when there is some
  load on the server;

* discontinues responding to requests from Apache daemon --
  Apache waits and waits, then sends out an 'internal
  error' message;

* responds to a few simple command line requests from
  root (things like ls, cat, and file) but hangs when
  a command like "ls -l" is used;

* continues to respond to queries made through an NFS
  connection, including reading files from the server
  (even though root at the exact same time is frozen
   at the console);

* the 'hangs' occur equally on IDE and SCSI drives;

* no error messages are generated to the console or to
  the logs (I have remote logging to another server,
  so it isn't a matter of not being able to write the
  messages out to the drives).

My first guess was that this had something to do with my SCSI Adaptec
2940U2W controller or my LVD disks (Seagate Cheetahs). However, I've
kicked away at that idea for three months, trying many solutions. It may
have something to do with running software RAID-1, but it hangs the same
when the system is in degraded mode with only one SCSI disk operational.

According to the developer/maintainer of the RAID-1 code, Mingo, it is
highly unlikely to be a problem specifically with the RAID code. And if
I
can 'cat' a file during the freeze, then it isn't a disk system freeze
(how could I read from disk?).

I suspect it is something to do with taking a Debian Potato system and
putting a vanilla kernel on it patched with the aic7xxx driver and the
RAID code. Many people are running these patched vanilla kernels on
Redhat systems, so it seems this is something specific to Debian.

My system is:

ASUS P3B-F motherboard (Intel 440BX AGPset);
  CPU Bus/PCI Freq - 100.3/33.43
Single PII 400Mhz (Deschutes)
512MB 100Mhz ECC RAM
Adaptec 2940U2W (2.20.0 bios) -- Tagged Command Queueing enabled; max 32
commands per device; reset delay 5 sec.
SCSI Cable is 3' internal, teflon, custom made Ultra2-LVD with active
negation terminator
2xSeagateST39103LW (Ultra2-LVD drives)
NE2000 Clone (ISA)

I'd appreciate any suggestions. So far, the main things I've tried are:

 * testing kernels 2.2.10 to 2.2.16 with appropriate RAID
   patches (no differences);

 * replacing original Adaptec cable and terminator with custom built,
   high-end, teflon cable and terminator ($125US);

 * set up remote logging to catch any error message (none logged);

 * added verbose debugging (nothing);

 * upgraded the AIC7xxx driver from 5.1.21 to 5.1.30 (performance
   improved but it still hangs);

 * removed the IDE drives and kernel support for IDE, based on
   someone's hunch;

 * lowered the front bus speed from 100.3Mhz to 88.3 (underclocking
   CPU at the same time), based on another hunch;

 * checked IRQs for conflicts -- none ;

 * compiled the kernel with and without tagged command queuing,
   with more and fewer max commands;

 * added hard disk fans;

 * reduced SCSI controller speed on Seagate drives to 40MB
   from 80MB;

Again, any suggestions appreciated.

Thanks in advance,

Jeff Hill

-- 
------------------------------------------------------------
------  HR On-Line:  The Network for Workplace Issues ------
http://www.hronline.com - Ph:416-604-7251 - Fax:416-604-4708
------------------------------------------------------------

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Jun 15 2000 - 21:00:27 EST