RESEND: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/ high-res-timers/skas/sysemu)

From: Andrew A.
Date: Fri Oct 29 2004 - 11:14:01 EST

Next message: Andrew A.: "RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)"
Previous message: Andrew A.: "RESEND: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/ high-res-timers/skas/sysemu)"
In reply to: Andrew A.: "RESEND: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/ high-res-timers/skas/sysemu)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Try #3. This message is being resent because it was never forwarded by majordomo, it is superceded by "RE: Consistent lock up
2.6.10-rc1-bk7 (mutex/SCHED_RR bug)" sent minutes ago, which *was* forwarded by majordomo. I have shortened the message
considerably by including links for, rather than quoting the sysrq-t output (didn't realize the original was 552k!), and have also
mangled links suspecting majordomo spam filtering.

-----
Caveat: This may be an infinite loop in a SCHED_RR process. See very bottom of email for sysrq-t sysrq-p output.

I have in the past posted emails with the subject "Consistent kernel hang during heavy TCP connection handling load" I then
recently saw a linux-kernel thread "PROBLEM: Consistent lock up on >=2.6.8" that seemed to be related to the problem I am
experiencing, but I am not sure of it (thus the cc:).

I have, however, now managed to formulate a series of steps which reproduce my particular lock up consistently. When I trigger the
hang, all virtual consoles become unresponsive, and the application cannot be signaled from the keyboard. Sysrq seems to work.

The application in question is called "tt1". It runs several threads in SCHED_RR and uses select(), sleep() and/or nanosleep()
extensively. I suspect there's a good chance the application calls select() with nfds=0 at some point.

Due to the SCHED_RR usage in tt1, before executing the tt1 hang, I have tried to log into a virtual console on the host and run
"nice -20 bash" as root. THe nice'd shell is hung just like everything else.

Did I do it right? I was trying to make sure this hang is not simply an infinite loop in a SCHED_RR high priority process (tt1).

I initially had a lot of trouble trying to capture sysrq output, but then I checked my netlog host and found (lo and behold) that it
had captured it! Of course, that was before I went through the trouble of taking pictures of my monitor! I've included the netlog
sysrq output from two runs below. They are at the very bottom of this email, separated by lines of '*'s These runs are probably
DIFFERENT than the runs from which I produced the below screenshots.

So, here are those screenshots, I still welcome any comments you might have about easier ways to capture sysrq output than using
netdump!

I modified /etc/syslog.conf to say kern.* /var/log/kernel, however, output of sysrq-t and sysrq-p while in the locked up state never
ends up in the file (though, it does, when not locked up).

Due to apparent majordomo spam filtering, these URLs are mangled.

All of these files are at h t t p : / / w w w . m e m e p l e x . c o m /

sysrq-t --> lock1.gif, lock2.gif

sysrq-p --> lock3.gif

Kernel symbols for above are at System.map-2.6.8.1.gz

sysrq-t --> lock4.gif

sysrq-p --> lock5.gif

Kernel symbols for above are at System.map-2.6.8-1.521.gz and

A.

***********************************************************

same url prefix...

sysrq1.txt.gz

sysrq2.txt.gz

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Andrew A.: "RE: Consistent lock up 2.6.10-rc1-bk7 (mutex/SCHED_RR bug?)"
Previous message: Andrew A.: "RESEND: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/ high-res-timers/skas/sysemu)"
In reply to: Andrew A.: "RESEND: Consistent lock up 2.6.8-1.521 (and 2.6.8.1 w/ high-res-timers/skas/sysemu)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]