Re: A true story of a crash.

Ian and Iris (brooke@mail.jump.net)
Fri, 14 Aug 1998 14:45:57 -0500


McGee, Chris wrote:

> > A true story:
> >
> > The time is 12:05 pm CST. The date is now. You are merrily using your
> > personal
> > Linux 2.1.115 system, testing Communicator 4.5 PR1, when all of the
> > sudden and
> > out of the blue, the hard drive starts cranking ever harder. Xload
> > scales down a
> > few times as the load average goes balistic. Quickly the machine
> > grinds to a
> > halt. The mouse won't move - you can't even change virtual consoles.
> > Still the
> > hard drive thrashes. You remember that you compiled the Magic SysRq
> > key in, so
> > in desperation, you try it. Alt-SysRq-K. There. You won't be able to
> > use the
> > console until you reboot (notwithstanding various uncouth dosemu
> > tricks) but at
> > least the system has stopped thrashing.
> >
> Communicator 4.x hard-locks my SMP box periodically. No
> magic SysRq allowed. It doesn't run out of RAM, the machine just DIES. I
> suffered through this all the way from 2.1.108 until recently, when I
> upgraded Communicator.
> Upgrading to Navigator 3.04 fixed my problems, and the
> interface is nicer than 4.x too :) I am really hoping someone sees the
> light and continues working on the 3.x series... 4.x is so big and silly
> I wouldn't want to use it even if it DID work right. I want a good web
> browser, not a halfass client for every possible network service ever.
>
> > The machine must stay up!
> >
> This would be good, yes :)
>
> > Suggestions are welcome.
> >
> Duck! He who essays a memory management idea is he who
> is about to get 300,000 responses :) The fact that we're in a code
> freeze may save you from the worst of it tho :)
>
> I think the idea of killing big processes in times of
> low resources won't win many friends, though, especially with all the
> databases being ported to Linux.
>
> --Chris
>
>

Just imagine Ingress mallocing 64 megs on your 32 meg machine with 16 meg swap.
Tell me if it is more correct to kill the process and report why, or simply to
lock up?

According to an IBMer I know, when faced with extremely low swap, AIX will first
send SIGWARN to all processes. This is normally ignored, but a polite process
may try to do something to free up some memory. After that fails, it will start
killing RANDOM processes, in hopes that it kills the worst offender. I imagine
it gives SOME diagnostic message.

Essentially there are two primary means of "attack" against a system's MM code:
1. Monolithic Huge Process
2. Many Small Forking Processes

In either case, the system needs a way of deciding which process(es) to kill.
For the first case, it's easy. For the second case, we have a determined and
concious attacker. In any case, the strategy I outlined would correctly point
the finger, because a quick analysys of the logs would tell whodunit.

Further, I would put my database in a runlevel. This fixes the problem of it
being killed. Since it's TP, it can safely evaporate without data corruption,
and upon restart it would know where it was.

Is there some magic of ulimits I can do to prevent the system from trying to go
crazy? I know I can invoke the craziness with a make -j in the mozilla source
tree, and also if netscrape decides to get too fat.

BTW, if _ANY_ application can hard-lock your machine, it is a kernel bug or a
hardware problem. That's the philosophy I've always thought we held.

Hey, I've an idea! Kill process GROUPS which are biggest AND which are connected
to TTYs first. Thus, if you try to kill the system, you will be logged out.

For the case of process groups w/o ttys, kill the process GROUP responsible for
the most recent memory allocation.

This will ALMOST ALWAYS be the RIGHT THING.

(I think.)

Ian

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.altern.org/andrebalsa/doc/lkml-faq.html