I have been fighting a problem with 2.4.5 for a little while, I just
installed 2.4.8 onto these boxes and can still get them to lock up within
an hour.
I have multiple athlon servers that I can get to lockup at will.
these servers are MSI motherboards, 512MB ram PC133, ATA100 20G hard
drives. 1.2GHz t-bird, D-link quad fast ethernet.
when they lock up I cannot ping the box, the screen is blank (does not
reapond to alt-sysrq), keyboard numlock does not respond, the reset button
does not work and the power switch needs to be held down for 4 sec to shut
off power. Nothing appears in syslog (no messages at all between boot and
lockup, except for proxy logs in method 1).
I have two ways I can get the box to die.
method 1: over the network.
I have the plug-gw proxy from the firewall toolkit installed. If I make
sufficiant connections to the proxy rapidly enough the box will lock up.
I started this test in an attempt to find out why the production box was
slowing to a crawl after being in production for a while (under the same
workload it was shifting from 5%user/10%system to 2.5%user/97.5% system
and staying that way even if the load went away. A system reboot would
clean things up for a day or so. In attempting to duplicate the problem I
have been hammering the box with connections much more rapidly and the
problem seems to appear faster if I hammer faster.
the plug-gw does log heavily, with syslog configured for sync logging I
max out at ~80 connections/sec with it set for async logging I get ~300
connections/sec. in my latest test the log file grew to 39MB on an
ext2fs with ~2100 lines of binary junk appended to the file between the
last intact log message and the boot messages.
method2: no network.
I created a simple script
while (true) do
date >>junk.lots
I start 20 of these running at the same time and the box will die within
an hour or so. I have seen the junk.lots file be ~70MB at the time of
death, and at the last crash it is 64MB (contains the last two crashes,
and ~1200 lines of garbage that look like they are part of a syslog file
from Aug 2)
On both machines I have gone into the BIOS and set 'failsafe defaults'.
this helped lengthen the time between crashes (before I could sometimes
get the machines to crash just by letting them sit idle for several
attached is the .config used to build the kernel and the lspci -vxx from
each of the two machiness.
please tell me what I can do to assist in debugging this problem.
David Lang
This archive was generated by hypermail 2b29 : Wed Aug 15 2001 - 21:00:56 EST