Re: PPP slow between Linux and Solaris - why is it so?

Jason Merrill (jason@cygnus.com)
Sat, 13 Jul 1996 21:22:19 -0700


>>>>> Dave Cole <dave@edipost.auspost.com.au> writes:

> I remember someone mentioned that Solaris was retransmitting packets too
> quickly or something causing >50% of bandwidth to be used up by duplicate
> packets. I got the impression that Solaris was to blame. Is this
> correct? In any case, I do not remember anyone ever mentioning a fix /
> trick / work around to get the throughput back up to what it should be.

-] Subject: Announcing New TCP Performance Patch
-] Date: 7 Jun 1996 23:36:21 GMT
-] From: cathe@Eng.Sun.COM (Cathe A. Ray)
-]Organization: Sun Microsystems, Inc.
-] Newsgroups: comp.unix.solaris
-]
-]
-]Sun doesn't ordinarily announce patches when they're released. But
-]we've just finished a series of TCP-related fixes and improvements, and
-]we want to make sure that the news gets out as quickly as possible to
-]the many people who can benefit from our work.
-]
-]This patch announcement will be of interest mostly to folks who use Sun
-]workstations over "slow" links, like most dial-up lines. Please note,
-]though, that you might benefit from the work we'll discuss here even if
-]you've never used one of our workstations directly. (Many companies
-]who provide Internet access use Suns as part of the communication path.
-]And the patches are for Suns running Solaris 2.4 and up.)
-]
-]Also note: This message is coming to you directly from the engineers
-]who did the work. We wanted to get the information out to you right
-]away, but we really aren't trying to replace all the other Sun sources
-]of information you might have access to. Please, don't send us lots of
-]detailed questions--we're not volunteering to answer them (or even
-]respond to many of the followups here). We just really wanted to make
-]sure this message got out. Thanks.
-]
-]Cathe A. Ray
-]Manager, Internet Engineering
-]
-]
-] TCP Performance Improvements For Slow Network Links
-] ===================================================
-]
-]Our Sun team is responsible for basic network communications software.
-]We've been putting in a lot of work lately on improving the performance
-]of TCP over slow network links. Now we're finished; testing is
-]complete; and the patches (for Solaris 2.4 and later) will be available
-]shortly.
-]
-]We undertook the work in response to feedback from customers serving
-]WWW users over asynchronous PPP links. Users of LANs and WANs built on
-]10base-T and faster media never saw the problem behavior, which
-]actually affected FTP and other TCP-based applications as well.
-]
-]With the new patches in place, slow links will operate with roughly the
-]same efficiency as fast links. Without the patches, efficiency of very
-]slow links could, under Solaris 2.5, sink to as low as 5 per cent of the
-]theoretical maximum.
-]
-]In the following sections we will describe in detail what was wrong and
-]how we fixed it. If you don't need to know all that, just check the
-]table below for the patch numbers. They'll be available soon from our
-]usual patch sources. We're confident that customers who have seen the
-]problem will now observe a remarkable improvement. Others will see no
-]change.
-]
-] SPARC:
-]
-] module
-] 2.4 2.5 2.5.1 affected
-] |-----------|-----------|-----------|-----------------|
-] | 101945-xx | 103169-05 | 103582-01 | /kernel/drv/ip |
-] | 101945-xx | 103447-03 | 103630-01 | /kernel/drv/tcp |
-] |-----------|-----------|-----------|-----------------|
-]
-] X86:
-] module
-] 2.4 2.5 2.5.1 affected
-] |-----------|-----------|-----------|-----------------|
-] | 101946-xx | 103170-05 | 103581-01 | /kernel/drv/ip |
-] | 101946-xx | 103448-03 | 103631-01 | /kernel/drv/tcp |
-] |-----------|-----------|-----------|-----------------|
-]
-] PowerPC:
-] module
-] 2.5.1 affected
-] |-----------|-----------------|
-] | 103583-01 | /kernel/drv/ip |
-] | 103632-01 | /kernel/drv/tcp |
-] |-----------|-----------------|
-]
-] Note: Where a revision number has been indicated it you should
-]ask
-] for the patch of at least that revision. In the case of
-]the 2.4
-] patch revision number it was not available at the time of
-]this
-] posting. Always try to get "the latest version" of any
-]patch
-] you go after.
-]
-]
-]HISTORY
-]
-]Strangely, the decline in throughput was the result of several
-]improvements we made over the years to the TCP retransmission
-]algorithms and parameters. Every change improved performance for
-]systems with fast links. The cumulative effect for slow links was just
-]the reverse; but almost all our systems--and our customers'--were
-]hooked up to fast links, and the drawbacks went largely unnoticed. That
-]was the state of affairs at the time 2.4 was released.
-]
-]By the time 2.5 came out, async hookups to the Web had exploded. We had
-]implemented another relatively minor TCP bug fix. Customers with fast
-]links were better off. The efficiency of slow links declined. We
-]quickly learned we had a problem.
-]
-]We tracked down the inconsistencies and rewrote the code. We've
-]redesigned the algorithm for good behavior across all supported
-]configurations. We've added slow links and a wide mix of simulated
-]platforms to our test beds, and tested the fixes in both high-speed and
-]slow-speed networks. The problem is resolved.
-]
-]Excellence is a moving target.
-]
-]
-]TECHNICAL DETAILS
-]
-]Here are some technical details. As you'll see, we've made it a pretty
-]frank discussion. (Please be aware, though, that we do not intend to
-]spend much time debating our decisions here.)
-]
-]The throughput troubles on slow lines result from an excessive rate of
-]retransmissions. The rate, in turn, is caused by a mis-tuning adaptive
-]algorithm.
-]
-]TCP packets are retransmitted if no response is received before a
-]timeout period has expired. Our routines implement a variant of the
-]familiar Karn and Jacobson adaptive algorithms, which attempts to
-]predict an efficient timeout value based on the time it took previous
-]packets to complete a roundtrip. Elapsed values are combined into a
-]smoothed average roundtrip time ("RTT") and variance.
-]
-]The key elements in this calculation are the initial RTT value and the
-]subsequent RTT's factored in. The changes we have made involve both of
-]these key areas.
-]
-]
-]INITIAL RTT VALUES
-]
-]As an unintended result of several cumulative changes, the kernel
-]parameter "tcp_rexmit_interval_initial" was actually not being used. In
-]fact, all Internet Routing Entry (IRE) RTT values were being
-]initialized to 512 milliseconds. TCP was using that as an initial
-]setting.
-]
-]For connections which flow through a route with a roundtrip time less
-]than that (such as a LAN or WAN built on 10base-T) all was well. When
-]the connection closed, the actual IRE RTT value was updated and the
-]predictive timeout value successfully adjusted.
-]
-]For connections with an RTT greater then 512 ms, however, the timeout
-]would necessarily trip, and retransmissions occur. If the actual time
-]differed sufficiently from the original estimated value, TCP was never
-]able to send a segment without one or more retransmissions. A realistic
-]RTT for the route could never be established. This scenario is the
-]beginning of the explanation of what has been happening on several-hop
-]Internet or asynchronous PPP links.
-]
-]Our solution is to initialize all IRE RTT's to zero instead of 512 ms.
-]Any new connection for a route will now, when lookup discloses the zero
-]value, get the value of the "tcp_rexmit_interval_initial" parameter
-]instead. (And it's been increased to 3 seconds.) So in most cases the
-]adaptive algorithm will now be able to adjust timeout values
-]effectively.
-]
-]
-]RTO (RETRANSMIT TIMEOUT) ALGORITHM INTERACTION
-]
-]Another factor contributing to packet congestion and retransmission was
-]a change to the RTO algorithm, introduced in a 2.4 Kernel Patch. The
-]intent was to make the behavior more "conservative"--that is, lower the
-]risk of poor timeout values. The effect on low-speed links was
-]unexpectedly contrary.
-]
-]A key (and unintended) effect of the code change was that RTT data from
-]retransmitted packets was discarded. This behavior, together with the
-]poor initial RTT values described earlier, meant that the adaptive
-]algorithm was deprived of the information needed to adjust the RTO.
-]
-]Our solution is to keep the RTO RTT update still conservative, but now
-]update the RTO after no more than one receive window's worth of valid
-]RTT's. Further, when an invalid RTT is seen--an ACK of a retransmitted
-]segment, for example--any valid RTT information is fed into the RTO
-]algorithm.
-]
-]
-]ZERO WINDOW PROBE BUG FIX
-]
-]The problems described so far affect Solaris 2.4 and 2.5 equally. What
-]changed with 2.5?
-]
-]One important fix we included in 2.5 was for the "zero window probe"
-]bug, a well-publicized problem affecting just about all versions of
-]UNIX. As part of that rewrite, we removed a nondescript piece of logic
-]that implemented a simple "backoff" scheme. The excised code caused the
-]RTO to be lengthened by one-eighth as a result of certain failures. It
-]seemed not to be needed; but it had concealed the presence of the other
-]bugs by providing a means for the RTO to reach a successful value. When
-]this code was removed the other underlying problems were exposed.
-]
-]
-]IRE RTT LOGIC
-]
-]This last part of the problem concerns the interaction between TCP and
-]the Solaris-specific Internet Routing Entries. The IRE RTT logic caches
-]RTT values to be re-used when a new connection is made over a familiar
-]link.
-]
-]This is a fine approach. The implementation, however, had a flaw: the
-]IRE RTT was updated regardless of the RTT value supplied by TCP.
-]
-]As you will have guessed by now, users of high-speed links saw no
-]effect. But in highly variable RTT routes, when a connection dominated
-]by small segments was closed, a problem could result. An RTT too short
-]for large segments was used to update the IRE RTT, and a subsequent
-]connection dominated by large segments (like FTP) experienced an
-]excessive retransmission rate. It was a different path to a familiar
-]dilemma: too small a timeout value.
-]
-]Naturally the most highly variable RTT's tend to be seen on async PPP
-]links, where the RTT of the route is compounded from (1) wire latency,
-](2) low bandwidth, and (3) congestion/queuing delays as more than one
-]segment is transmited by TCP.
-]
-]Our solution is to add an new ndd variable "tcp_rtt_updates". It allows
-]tuning or disabling of IRE RTT updates. A value of zero disables IRE
-]RTT updates. A value greater than zero specifies how many RTT updates
-]to the RTO are required--that is, how many chances the algorithm has
-]had to adapt the timeout--before a closing connection will be allowed
-]to update the RTT in the IRE.
-]
-]
-]CONCLUSION
-]
-]We've fished out, fixed, and explained some subtle flaws in our
-]adaptive retransmission algorithm. We take the responsibility for
-]introducing them--and the credit, too, for practically every piece was,
-]by itself, a successful response to our customers' needs. Better and
-]exhaustive testing would have shown up the flaws earlier, privately,
-]harmlessly. That's always our goal, and our customers have a right to
-]expect the best. Yes.
-]
-]There's always tomorrow. In the meantime: we killed this one, folks.
-]Our sincere thanks for your attention--and your business.
-]
-]
-]--
-]=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-]Joseph F. Backo E-Mail: jfb@jfbnet.net
-]=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=