Re: 2.0.13 Sockets Stuck on close

Malcolm Beattie (mbeattie@sable.ox.ac.uk)
22 Aug 1996 11:34:22 GMT


In article <Pine.LNX.3.91.960822125444.31939B-100000@linux.cs.helsinki.fi>,
Linus Torvalds <torvalds@cs.Helsinki.FI> wrote:
>
>
>On 22 Aug 1996, Thomas Koenig wrote:
>>
>> In linux.dev.kernel, "Eric Schenk" <schenk@cs.toronto.edu> wrote:
>> >Does the stuck socket eventually
>> >disappear, or does it stick around until a reboot?
>>
>> It sticks around until the process dies (in my case, sendmail).
>
>Umm.. This sounds like "TCP_CLOSE_WAIT" as opposed to "TCP_CLOSE". And that
>makes sense, because any socket in TCP_CLOSE shouldn't even show up on
>netstat (because it simply isn't there any more). Maybe "netstat" has a
>bug and reports the wrong state?
>
>And "TCP_CLOSE_WAIT" should NOT time out. Because the state essentially
>means that the other end has closed the connection, and the networking
>side is now waiting for _our_ application to close down the socket. And
>obviously the kernel can't time out on that.
>
>So if you see sockets in CLOSE state (and assuming this really is
>TCP_CLOSE_WAIT), it probably means that there is a local application that for
>some unfathomable reason keeps the socket open. So rather than a kernel
>problem, it might indicate a user-level problem (a application getting stuck
>waiting for something to happen, possibly due to race conditions within the
>application itself due to signal handling or something)?

Our web server was up for three months on a 1.99.4 kernel before a
memory leak thrashed it to death while I took a weekend off a couple
of weeks ago. I thought that some of the TCP problems had been fixed
since then so I didn't think it was worth giving the gory details.
In case this problem is still relevant, I have the output of a
"netstat -not" done some days before it died. The connection states
break down as follows:
3 CLOSE
2 CLOSE_WAIT
532 CLOSING
9 ESTABLISHED
48 FIN_WAIT1
86 FIN_WAIT2
3 LAST_ACK
6 SYN_SENT
34 TIME_WAIT
Apart from that copy of the netstat output, I can't dig around any
more since it's been rebooted, of course. I would guess the kernel
memory leak (which ate up the 64Mb memory and 64Mb swap) was at least
partially due to network buffers (the send queue figures in that
netstat add up to 7Mb). However, it's still running the same kernel
(I might schedule an upgrade to a recent 2.0.x in a couple of weeks)
so, assuming the problems recur, I can poke around a bit on the running
system. The same stats as above for current netstat output show:

# netstat -not | tail +3|awk '{print $6}' | sort | uniq -c
1 CLOSE
2 CLOSE_WAIT
44 CLOSING
9 ESTABLISHED
10 FIN_WAIT1
23 FIN_WAIT2
2 LAST_ACK
5 SYN_SENT
22 TIME_WAIT

so if it *is* the same problem, maybe it's CLOSING we're talking about
rather than CLOSE or CLOSE_WAIT.

--Malcolm

-- 
Malcolm Beattie <mbeattie@sable.ox.ac.uk>
Oxford University Computing Services
"Widget. It's got a widget. A lovely widget. A widget it has got." --Jack Dee