Re: TCP write fails to timeout

Hubert Tonneau (tonneau.heliosam@hol.fr)
Sat, 31 Oct 1998 12:59:28 +0000


Hubert Tonneau wrote:
>
> I use TCP connections between
> - a Linux box (tested with 2.0.35 and 2.1.125)
> - OS/2 boxes.
>
> When the OS/2 box shuts down the connection in the middle,
> while the Linux box was writting,
> the connection will often remain infinitely in CLOSE_WAIT state.
> No SIGPIPE is received, and the 'write' operation never ends.

Short explanation:

The OS/2 box is not closing the TCP sockets when the process is
exiting, so the TCP connection remains opened until the box is
rebooted.

Detailed explanation:

1) In OS/2, the socket handles are not file handles (you cannot
use the file API to deal with sockets)
2) They need to be herited when calling the OS/2 'fork' replacement
function in order to make porting from Unix to OS/2 as easy as
possible.
3) So the OS/2 team decided to make the socket handles global to
all processes.
The proof of it is that a process can close the sockets of any
other process, even if it's not a child process.
Ugly isn't it !
4) Another even worse side effect is that if a process exits
without closing a socket, it will remain opened until the
OS/2 box is rebooted.

So if some of you still use OS/2 TCP stack, you have to be sure
that your application keeps a list of all opened sockets and
records an exit function that close all of them.
If you don't, any time you kill a process or the process exits
in the middle due to an error, some sockets may remain opened
and consume ressources both on the OS/2 box ... and on the box
at the other end of the link.
Of course, the OS/2 socket manual has no word about this problem.

Now if you cannot check or change the OS/2 application that use
TCP (source code not available), then you have to reboot the
OS/2 box often, even if it's very reliable, so that the set of
OS/2 boxes dont slowly exhaust the TCP ressources somewhere in
the network.

In my case, the OS/2 server was exhausting it's TCP ressources
after a few days, and sometime after a few hours, and needed
to be rebooted.
In the same situation a Linux box has exactly the same
problem, even if Linux is doing everything right.
So you've been warned.

Thanks to Andi Kleen for his comment about tcpdump that led me
to the solution.

Hubert Tonneau

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/