Kernel deadlock in 2.2.7 and -9 SMP

Sean Hunter (sean@uncarved.co.uk)
Wed, 26 May 1999 10:09:22 +0100


Yesterday a disastrous upgrade led me to find a kernel deadlock. This
totally killed 2.2.7 and 2.2.9 several times (about eight) on our main
production server before I found out what was causing it. There is
never an oops, and sysrq doesn't work. The machine is just stone dead
with nothing on the screen, no response to keyboard events and no
ping. Syslog on this box should send kernel messages to another
server, but we get nothing in the logs on that box either.

Basically the problem is (I'm pretty sure) large numbers of sockets in
TIME_WAIT. The machine runs an "rpc-like" server process, which is
basically a perl daemon that does remote perl calls over sockets. Due
to the braindamaged way this thing is designed, it opens a new socket
for every call. Another bug caused an error to bounce between the
server and the client box, resulting in huge numbers of sockets in
TIME_WAIT. I'm not sure what they were when it died, but a couple of
times I've seen in excess of 900 sockets in TIME_WAIT before killing
everything. I imagine that when it dies the number is much higher.

I'm going to try to reproduce this here (with a different network
card) today. If people give me some suggestions about what info they
want, hopefully I can have a description of the problem and a short
exploit program up today.

Sean

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/