Re: SMP Instability

James Mastros (root@jennifer-unix.dyn.ml.org)
Fri, 7 Nov 1997 11:42:34 -0500 (EST)


On Wed, 5 Nov 1997, Martin Imrisek wrote:
> I might as well continue it here. I've recently begun to use the SMP
> kernels 2.0.31 and 2.1.62 on my Tyan Tomcat IV (dual P75). I've been
> experiencing some rampant instability from spontaneous reboots, deadlocks
> and halts with garbage on the screen. I've eliminated the reboots, which
> appeared to be caused by hardware misconfiguration.
>
> Currently I am running 2.1.62 without any patches and I am experiencing
> the following:
>
> Nov 5 22:04:15 orpheus kernel: d_alloc: 3650 unused, pruning dcache
> Nov 5 22:04:15 orpheus kernel: d_alloc: 3650 unused, pruning dcache
> Nov 5 22:04:15 orpheus kernel: d_alloc: 3649 unused, pruning dcache
>
> these messages keep on appearing, though they only appear when compiling
> glibc.
These are just debugging messages that aren't given a priority. Ignore
them. The reason you only get them in that context is because you are
dealing with large numbers of files.

> During this time the compile slows to a crawl, eventually freezing
> the machine. Kernel compiles work flawlessly?!
Now that /is/ a bug...

> Running several high CPU usage programs seems to result in a deadlock.
> More often than not, running an OpenGL xlock results in a frozen machine
> after some time. (like overnight).
Try running a top and a high cpu program (try a rc5 cracker from
www.distributed.net -- more useful, and will burn more cpu (all of your idle
time)). That way, you can see just when you are hanging. Also, keep out of
X. See if you can do shift-scrolllock / control-scrolllock / sysrq-various
/ change-vts. If you can, then userspace is frozen (EG your X server), and
not the kernel).

> Deadlocks also seem to happen when running several gimp filters on large
> images, though this does not happen often.
Likely from high CPU usage.

> Or, compiling the 2.1.62 kernel
> with 'make -j'.
Again high CPU. Possibly high CPU and memory?

> On several occassions I've ended up with garbage
> >000:0000 repeated all over the screen. This seems to be a phenomenon
> with the 2.1.xx kernels only. I've never seen it under 2.0.31.
Sounds like a hw problem to me...

>
> Also, recently I've been getting these kinds of messages once in a while:
>
> Nov 5 21:58:35 orpheus kernel: Unable to handle kernel NULL pointer
> dereference at virtual address 00000199
These are almost definatly a kernel bug. Run them through ksymoops, and
mail the list with the results.

> This has also happened when doing a 'cat /dev/fd0 |less' to look at some
> raw data. Under 2.0.3x this would at worst mess up my terminal.
Yep, messing up your terminal is what this should do.

> Sound on 2.0.31 either works, or causes spontaneous deadlocks (MAD 16
> card).
Sounds like a hw problem.

> On 2.1.62 sound works but has periodic 'clicking' static like sounds when
> playing CDs.
Sounds like a bad CD - playing a CD is completly between the CD-drive and
the soundcard. All the kernel does is tell it to start.

> --------------------------------------------------------------
> Martin Imrisek "I've done . . . questionable things.
> imrisek@interlog.com Nothing the God of biomechanics
> wouldn't let you into heaven for."

-=- James Mastros
---
When the annals of distributed computing are written, and the name 'Bovine'
appears in there, I can say "Hey, I was a part of that, I checked .0012% of
the keyspace".
-=- Brian Wilson <wilsonb@mindspring.net>
Go to www.distributed.net before I come make you!