RESEND: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/ high-res-timers/skas/sysemu)

From: Andrew A.
Date: Fri Oct 29 2004 - 10:43:40 EST


This message is being resent because it was never forwarded by majordomo, it is superceded by "RE: Consistent lock up 2.6.10-rc1-bk7
(mutex/SCHED_RR bug)" sent minutes ago, which *was* forwarded by majordomo. I have shortened the message considerably by including
links for, rather than quoting the sysrq-t output (didn't realize the original was 552k!)

-----
Caveat: This may be an infinite loop in a SCHED_RR process. See very bottom of email for sysrq-t sysrq-p output.



I have in the past posted emails with the subject "Consistent kernel hang during heavy TCP connection handling load" I then
recently saw a linux-kernel thread "PROBLEM: Consistent lock up on >=2.6.8" that seemed to be related to the problem I am
experiencing, but I am not sure of it (thus the cc:).

I have, however, now managed to formulate a series of steps which reproduce my particular lock up consistently. When I trigger the
hang, all virtual consoles become unresponsive, and the application cannot be signaled from the keyboard. Sysrq seems to work.

The application in question is called "tt1". It runs several threads in SCHED_RR and uses select(), sleep() and/or nanosleep()
extensively. I suspect there's a good chance the application calls select() with nfds=0 at some point.

Due to the SCHED_RR usage in tt1, before executing the tt1 hang, I have tried to log into a virtual console on the host and run
"nice -20 bash" as root. THe nice'd shell is hung just like everything else.

Did I do it right? I was trying to make sure this hang is not simply an infinite loop in a SCHED_RR high priority process (tt1).

I initially had a lot of trouble trying to capture sysrq output, but then I checked my netlog host and found (lo and behold) that it
had captured it! Of course, that was before I went through the trouble of taking pictures of my monitor! I've included the netlog
sysrq output from two runs below. They are at the very bottom of this email, separated by lines of '*'s These runs are probably
DIFFERENT than the runs from which I produced the below screenshots.

So, here are those screenshots, I still welcome any comments you might have about easier ways to capture sysrq output than using
netdump!

I modified /etc/syslog.conf to say kern.* /var/log/kernel, however, output of sysrq-t and sysrq-p while in the locked up state never
ends up in the file (though, it does, when not locked up).


sysrq-t --> http://www.memeplex.com/lock1.gif, http://www.memeplex.com/lock2.gif

sysrq-p --> http://www.memeplex.com/lock3.gif

Kernel symbols for above are at http://www.memeplex.com/System.map-2.6.8-1.521.gz

sysrq-t --> http://www.memeplex.com/lock4.gif

sysrq-p --> http://www.memeplex.com/lock5.gif

Kernel symbols for above are at http://www.memeplex.com/System.map-2.6.8-1.521.gz and

A.

***********************************************************

http://www.memeplex.com/sysrq1.txt.gz


***************************************************************************************

http://www.memeplex.com/sysrq2.txt.gz


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/