Re: TCP Stall

Dave Dunston (dunston@netins.net)
Sun, 30 Mar 1997 22:44:40 -0600


Richard B. Johnson wrote:
>
> March 30, 1997
> Via PPP from Groveland, Massachusetts, USA at about 10 baud
> on a 56 kb link.
>
> Gentlemen,
>
> I have been looking into the TCP Stall problem when FTPing
> files between Linux machines via a PPP link. This problem
> also occurs with remotely mounted file-systems. I have
> reported this problem over 10 times during the past two
> years and I have recorded at least 25 other instances in
> which other users have reported the same problem.
>
> I dump all communications on specific problems that I have
> encountered into separate Pine "folders", so it's easy to
> maintain a history of a specific problem.
>
> Apparently this problem is not considered important because
> absolutely nothing has been done about it for over two
> years. There have been no experimental patches from Network
> gurus attempting to fix this very real and very troublesome
> problem.
>
> Instead, I see on the "list" much more important ideas about
> graphical boot-up and other esoterics.
>
> My setup consists of a Linux router, quark.analogic.com,
> which is visible on the Internet via a Cisco interface. This
> router uses Ethernet for its primary communications
> interface. This router establishes a PPP Link to another
> router, skunkworks.analogic.com. This router handles
> Ethernet traffic from my LAN at home to the router at work.
> When everything is working correctly, I can access all my
> work computers and any Internet services, from any of my
> machines at home. All of the machines at home, and the
> router are work are Linux Machines. Other machines at work
> are Sun Pizza Boxes, SGI machines, a new Alpha, and several
> old VAXen. I have connectivity to all these machines when
> the PPP Link is running.
>
> The problem is that data being transferred between links
> that use megabit speeds and links that use kilo-bit speeds
> needs flow control.
>
> The RFCs address flow-control using a variable length
> window. RFC-793 addresses the basic window method for flow
> control. Since it was written, there has been extensive work
> on TCP algorithms to optimize data communications. RFC-1122
> addresses the "Silly Window Syndrome".
>
> The Nagle Algorithm implemented in Linux, works to
> discourage sending tiny segments when the data to be sent
> increases in small increments, while the SWS avoidance
> discourages small segments all the time. It is possible, if
> the implementation is not robust, for the receiver to send
> two or more ACKs per segments received.
>
> Jacobson addresses this problem with the "slow start"
> portion of his algorithm. This algorithm is also provided by
> default in Linux. Failure of either of these algorithms to
> be correctly implemented could cause the problems being
> observed.
>
> Normally, I see the window set at 24,820 (right-hand edge).
> I don't know why. Perhaps someone determined that it was
> optimum. I observe that when the receive buffer gets full on
> the machine that is routing packets to my PPP link, the
> window abruptly goes to zero (0). This is okay, it means "I
> don't have any more room". It could have slowly closed, but
> it doesn't. When the window is zero, the machine attempting
> to send data to the router, stops sending data. This is
> correct. It is not allowed to send data when there is no
> room for it. It CAN send packets, however it MUST NOT send
> packets containing data.
>
> Now, how does the machine that received a window of zero
> know that buffers are available again? I watch the Sun send
> a SYN. It receives an ACK with the new window. I don't know
> if this is the correct thing to do according to the RFCs,
> but it works. It is likely that the routing machine, i.e.,
> the one that has buffers loaded with data, trying to free
> them by getting the data squeezed into the PPP link, should
> be the machine to send a SYN when buffers are available
> again.
>
> RFC-1122 defines a standard way to "probe" for the new
> window after the window has shrunk to zero. This is shown in
> 4.2.2.17.
>
> This does not appear to happen with the Linux machines
> although it is has been confirmed that "tcpdump" will
> randomly drop packets, and often the important ones for
> which you are watching.
>
> The machine will stall for as much as 30 minutes until the
> sender re-sends an unsolicited data packet (Yes, a packet
> with data even though the window was closed). The packet is
> ACKed with the new window and normal data-flow restarts
> until the router's buffer is full again. This continues
> until the file has finally been sent.
>
> The result is that a 1/2 megabyte file will take up to 2
> hours to be sent on a 56 kb link. Sun's "snoop" seems to be
> a lot better at looking for problems than "tcpdump". Tcpdump
> seems to lose a lot of packets. It also fails to interpret
> some of them. When looking for network problems, beware of
> tcpdump. It is not a very good tool. Perhaps if its captured
> binary data were first written to a file, it would not lose
> so much information.
>
> Will someone please look into this?
>
> Cheers,
> Dick Johnson
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
> Richard B. Johnson
> Project Engineer
> Analogic Corporation
> Voice : (508) 977-3000 ext. 3754
> Fax : (508) 532-6097
> Modem : (508) 977-6870
> Ftp : ftp@boneserver.analogic.com
> Email : rjohnson@analogic.com, johnson@analogic.com
> Penguin : Linux version 2.1.30 on an i586 machine (66.15 BogoMips).
> Warning : I read unsolicited mail for $350.00 per hour. Supply billing address.
> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

I diddo that exact problem too. I have a PPP connection which will
occasionaly stall for long periods of time then eventually will stop
trying to get the file. Hope this gets fixed sometime soon.

-Shaun Dunston
dunston@netins.net