it seems we discovered the same ...
my SMP Box (2.0.29) runs for a long time,
but after replacing the 2x133MHz CPU´s with 2x200MHz
the same Kernel crashed ...
it was a respond to ping and telnet - but no
connection will be established ...
and all this happend while a backup via NFS was done!!!!
switching to 2.0.31 was terrible since it locked after 5
hours of running ...
BTW : the machine is a heavy loaded WEB-Server
loadavg 2....20 more than 200 Procs in the list (and more)
but NO special hardware inside (EIDE with TYAN-Board)
and no smp-kernel of the 2.0.xx serie will work without a lookup
:-(
hope the >2.1.64 will help
CIAO SVEN
---------
> Again... I believe that the software watchdog is designed to simulate
> a
> real process, under the assumption that if a normal process can't
> start
> fast enough, the machine must be deadlocked and should be rebooted. It
> doesn't do well when swaping is so excessive that it can't start. I
> have
> _no_ problems with this, in theory.
>
> The problem is that users don't understand what's going on and try
> their
> damaging actions over and over. Yes, this is where process limits come
> in.
> I should. I haven't. It's irrelevant. I haven't had to reboot the
> machine
> in over a month now (although now that we have snow the power failures
> will now doubt start :-( ).
>
> The web server OTOH died with a DEADLOCK and didn't reboot. Yes, I'm
> leaving myself open to vulnerabilities by not having a hardware
> watchdog,
> but I can live with that. 99.44% of the time the kernel knows when bad
> things are happening. Most of those are panics. I myself haven't had a
> bad
> freeze in a long time (good hardware is a good thing). However, this
> DEADLOCK is a problem, esp on SMP machines.
>
> Back to the original problem, the machine was grinding itself into the
> ground this morning. There were a LOT of validating probes on the
> screen
> (and scrolling). There were 2 device errors, one for each IDE hard
> drive
> (4.3G Caviar). There was one device not ready error 03:03. The machine
> would respond to pings and tried to open connections, that's about it.
> Again, it was in the middle of another tape backup from another
> machine
> via NFS. The situation seemed very much like out of memory, but I
> couldn't
> prove it and didn't have the time to deal with it, so I pulled the
> plug.
> Nothing in the logs.
>
> Is there something about NFS that I should know about? Our other
> machine
> (PPro 180 clocked to 200) doesn't seem to have a problem with the
> backups,
> but it's not serving 100+ web pages at the same time. A conflict with
> cookies maybe?
> -Rob H.
>