My Linux Crash / Lockup Data

Keith Rowland (keithr@primenet.com)
Wed, 25 Feb 1998 00:59:49 -0700


Since I did not receive any feedback on my previous post, as promised
here is my history with Linux lockups to add to the 'data collective.'
I'll try to be as brief as possible, but this is a complete report.

The freeze problem I am talking about is the one where everything stops,
freezes, no ping, no keyboard response, no log data, no screen messages
(exception noted below). Only hard reset will restore the system.

I have run several Linux systems, the only ones showing this problem are
the ones running as full time web servers, located at colocation
facilities of two different ISPs. The two systems I run here in my home
office, never have exhibited the problem. I've had this problem on 4
different servers. I'll summarize the system configurations at the end
of this post.

My first 2.0.30 system ran fine for 3 months before starting to lockup.
The first few times, the system actually came back to life an hour
later, restarted by itself. Except for this first time, usually the
symptoms are total freeze and we restart it by hand within the hour. A
few times the system stayed locked up for over 3 - 8 hours before it was
manually reset, as it never rebooted by itself. The time between lockups
has varied from 20 minutes to 2 weeks. Typically though we get from 36 -
52 hours before lockup. It has happened during busy times and during
slow times. 10PM, 11PM at night, 3AM,4AM,6AM in the morning, 12 Noon.
Totally random. No pattern detected.

Now I run a fairly busy web site. We transfer from 8 - 10 Gigs a day. We
have run anywhere from 150 to 200 Web Clients. We've run for hours maxed
out and no crash, and then at 3 AM when usage is down to almost nil,
it'll freeze. Sometimes we will crash during busy times. No correlation.

I am still trying to determine what I changed on the system after the
first three months, since the same hardware/Kernel ran for 3 month
without incident. Then the problem started, with no change in hardware.
We first thought it was failing memory. Well after replacing everything,
we could rule that out.

I built up 2 new servers, to take the place of the one that was
freezing. New CPU, MEMORY, NETWORK CARDS, etc. You get the picture. My
first system had two HDs, one backed up the other, so I used these SAME
two HDs, one for each new system. So it was the same kernel version and
user programs and data, I only recompiled the kernel for the different
network cards. I ran DNS rotation for the web site, so to split the
load. Well the first machine locked up within 12 hours, the second about
a week later. So far, Linux 3, Sysadmin 0.

Then I decided to move the web site to a new hosting company that builds
the servers and you lease them. They built me up a server and I moved
the site off the dual servers and onto this new single server. 2 days
into the new site, lock ups. They never seen such a thing. Locked up 2-3
times a day, then would go 2 days without lockup. Daytime, nighttime,
anytime. No correlation.

Thankfully I finally resorted to the newsgroups and linux-kernel
archives and seen that I was not alone. While I new it was not something
I specifically did, I do know it is some particular configuration or
user program that I am running and some of you out there are running,
that is causing this. I'm still not sure whether this is a kernel
problem or a user space program problem. But since the most activity
seems to be here, and it is a complete lockup, it favors a kernel
problem, since previous kernels has run flawlessly in the past.
Anyway... back to some more facts. Linux 4, Sysadmin 0.

On my site, I run in addition to Apache, a Real Audio Server, a BBS
CGI-BIN program, limited ftp uploads and normal system daemons like
sendmail.

I've tried Apache 1.2.1 and 1.2.5, Real Server 4.0 and 5.0, Lundeen WebX
2.0.1. and 2.0.2. I even removed the Real Audio Server and the BBS from
the servers and the servers still locked up. It's not RA or BBS. It has
locked up before and after I started using POP mail service on the
server. It's not POP. Could still be Apache or sendmail, if this is a
user space program.

One time I was even rlogin'ed, and was running 'top' when it crashed, no
unusual process showed up. Machine was only moderately loaded.

As for the kernel, it crashed on 2.0.30 on the first three servers, and
2.0.33 on the latest server. Slakware distribution on the first 3,
RedHat 5.0 on the last, with 2.0.33 updated kernel. (RedHat 5.0 comes
with 2.0.32 I think)

I have run 2.0.30 (RedHat 4.2) on my two home systems without any
lockups. So I've narrowed the field. Linux 4, Sysadmin 2. Tonight I've
updated my home systems to 2.0.33, one is my internet gateway,
mailserver, samba server, print server, etc and the other is my personal
X-Win system, that I am using to type this on. I'll report any lockups
on these, but I don't expect any.

I am currently re-building up the first server, based on 2.0.33, and I
know it will fail also, but will be putting it into service next week. I
am still looking through the log files and directory and file time
stamps on the first systems hard drive, trying to figure out what I have
now on the system that changed back in mid December when this started.
Since I did have a working 2.0.30 system, in place working as a busy web
site, without locking up, You'd think I could find the problem. But I
haven't found it yet. It may just be coincidental however, so don't rule
anything out. I've forwarded two config files to the person who is
collecting them.

I will summarize my systems here that lockup:

SYS 1: P200, 128 MB RAM, 3COM 3C590 Vortex, PCI 430 TX Board,
2.0.30
SYS 2/3: P200, 128 MB RAM, KINGSTON NE2000, PCI 430 VX Board,
2.0.30
SYS 4: P233, 64 MB RAM, 3COM 3C905 Boomerang, PCI Triton Board,
2.0.33

All Systems had IDE Hard Drives, VGA Video Cards.

Kernel configuration was typical for standalone web site operation. No
PPP, NO SLIP, No serial ports, No parallel ports. Had IP Alias ON (4
virtual hosts), all typical good TCP/IP stuff ON, etc. I have no NFS, no
SAMBA, no Sound, no X, no tape, no mouse, no extra filesystems, or
special devices.

I'll field any questions to make this complete and am willing to try
anything and test anything I possibly can. BTW, the main web site is

http://daytaco.com

and will list all the sites that are running on the server.

I've also tested the system against the teardrop attack, it passed, and
checked the code for the MAX STACK problem in garbage collection. While
my 2.0.30 systems had neither of these, the newer 2.0.33 system still
locks up.

One EXCEPTION to the no screen messages, was last night (as noted in
yesterdays e-mail, but posted here for completness), is a bunch of
kernel messages from the sched.c module found filling up the screen.
These were:

Aiee: scheduling in interrupt 001256fd
Aiee: scheduling in interrupt 001256fd
Aiee: scheduling in interrupt 001256fd
Aiee: scheduling in interrupt 001256fd

Maybe someone can track this down. But we don't know if these were on
the screen long before the lock up or not. I will go an disable the
screen blanker, so maybe we can catch more of these messages.

Thanks for your patience. I hope we can find the common thread to this
problem. While I am a strong Linux enthuesist, and have defendied it
against people who want me to switch to Solaris, this is giving Linux a
bad reputation for a lot of people, and the sooner we can solve this the
better for the whole Linux community.

While I'm not a kernel hacker, I can read and write C-code and can make
changes in any programs needed. I'm going to invesitgate the user
programs, sendmail, wu.ftp, httpd, etc. I'll leave the possible kernel
diagnosis to you fine people on the list.

Sincerely,

Keith Rowland, Webmaster, Sysadm

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu