2.1.119 "oops"es on a few heavy-loaded servers

Simon Kirby (sim@netnation.com)
Wed, 2 Sep 1998 19:44:28 -0700 (PDT)


Hello,

Being the brave moron I am, I decided to put 2.1.119 on two of our
most-important heavily-loaded production servers at our web hosting
company. One is a statistics compilation server (which uses a lot of NFS
reads and writes), and one is our mail server and DNS server which also
has a lot of logins running pine (its average load average is about 3.50
and it does an average of about 0.5 megabytes of disk I/O per second).

Good news: The mail server run so much more efficiently that almost every
staff member told me they were amazed at how much faster the machine felt.
The stats server actually went from being completely unusable to feeling
almost unloaded when connecting to the Apache server on it. (Good work,
guys!)

Bad news: Two oopses on the mail server.

The machines were actually up and running fine for a few days...then the
mail server oopsed, and then again the next day. The stats server hasn't
crashed with 2.1.119 yet, but has with 2.1.117 and hasn't been stable
since 2.1.113 (which ran for two weeks)...All of the oopses that happened
were ones that caused it to kill the interrupt handler and it was hard to
do anything with. The second on the mail server had a code trace so long
that the screen just scrolled off with EIP traces. It wasn't possible to
scroll up, because the interrupts had been disabled. The first I wasn't
there for (it happened during the middle of the night), but the guy said
it fit on the screen.

Both machines have exactly the same hardware config, except for the
drives. Both are P2L97-S ASUS boards with onboard 2940UW SCSI, 256MB
SDRAM, 256MB swap, P2 233s. Mail server has 4 UW drives and 1 narrow,
stats server has 1 UW.

Now, I've noticed that with 2.1.117 and earlier kernels that it oopsed on
my desktop when I was just logging out all of my console shells just
before leaving to go home at the end of the day. It spat out a lot of
oopses (it seemed) all at once and then froze the interrupt handler again.
What remained on my screen was another extremely long call trace. This
also happened to a workmate running 2.1.117 as well, but it didn't freeze
the interrupt handler and it happened right after he logged out of one of
his consoles, too. I did ran ksymoops over it and it appeared to be in
the function called "find_buffer" (I don't have the oops any more, though
-- sorry). We (me and the workmate) haven't been able to reproduce it on
2.1.119, though.

I thought about it for a while and came to the conclusion that perhaps it
was related to swap, as both desktops were in the process of swapping back
in while exiting mozilla. Also, I was running a "vmstat 1" from another
machine on the mail server when the kernel panic occurred and there was a
small burst of swap ins just before it happened. And, the stats server
does some swapping when it his web sites with _huge_ number of hits, but
it doesn't seem to have died with 2.1.119 yet.

On another note, I was running "mkraid" (the raid construction tool) on
the mail server the other day and noticed that the constant write() calls
it was doing was forcing the machine to start swapping. If I didn't keep
suspending and resuming the process, it would swap forever. At one point
I disabled swap and let it run because it was getting annoying -- after
about 20 seconds the machine completely stopped responding for about 10
seconds and then resumed. During this time the machine pinged but all
user processes appeared to have frozen. About 60 seconds later it froze
again, this time for about 5 minutes. I was in the process of calling
somebody at the office when it suddenly unfroze again. I re-enabled swap
and the problem went away. It seems it's getting stuck in some sort of
loop looking for available memory -- suprising on a machine with 256MB of
ram, though. This might have been on the 2.1.117 kernel -- I can't
remember.

I will hit my machine at home hard with mass malloc()s and see if I can
make it die. I'll set my console to something insanely large and see if I
can capture any oopses.

Weee,

Simon-

| Simon Kirby | Systems Administration |
| mailto:sim@netnation.com | NetNation Communications |
| http://www.netnation.com/ | Tech: (604) 684-6892 |

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html