RE: TCP performance on a lossy 1Gbps link

From: Leslie Rhorer
Date: Sat Oct 24 2009 - 19:46:55 EST

> -----Original Message-----
> From: linux-net-owner@xxxxxxxxxxxxxxx [mailto:linux-net-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Matthew Hodgson
> Sent: Monday, October 19, 2009 6:23 PM
> To: linux-net@xxxxxxxxxxxxxxx
> Subject: Re: TCP performance on a lossy 1Gbps link
> Hi - many thanks for the response; comments are inlined:
> Leslie Rhorer wrote:
> >> I've got a dedicated 1000Mbps link between two sites with a rtt of 7ms,
> >> which seems to be dropping about 1 in 20000 packets (MTU of 1500
> bytes).
> >> I've got identical boxes at either end of the link running 2.6.27
> >> (e1000e, and I've been trying to saturate the link with TCP
> >> transfers in spite of the packet loss.
> >
> > Why?
> Because I'd like to use existing TCP applications over the link (e.g.
> rsync, mysql, HTTP, ssh, etc.) and get the highest possible throughput.

Well, the operative word here is "applications" - plural. Generally
speaking, one TCP connection is probably not going to be able to saturate a
high bandwidth, high latency link like that, TCP window scaling
notwithstanding. Doing so depends upon both hosts properly supporting
window scaling, needing to transfer large data streams, and being capable of
actually sustaining such a data rate in a real-world transfer.

> Right, I follow your example - but I thought that with SACK turned on
> (as it is by default), the Rx host will immediately send ACKs when
> receiving packets #14 through #20, repeatedly ACKing receipt up to the
> beginning of packet #13 - but with a selective ACK blocks to announce
> that it has correctly received subsequent packets. Once the Tx host
> sees three such repeats, it can assume that packet #13 was lost, and
> retransmit it - which surely only takes 1 round trip + 3 more packet

Assuming SACK is supported by both the Tx and Rx host, that is

> intervals to happen, rather than the 2 seconds of a plain old
> retransmit? Even without SACK, doesn't linux implement Fast Retransmit
> and cause the Tx host to immediately retransmit packet #13 after
> receiving 3 consecutive duplicate ACKs?

I could be mistaken, but I don't think it does. Even so, however,
there would be at least a 7ms delay = a pause of 800 KB - over 10 standard
64KB windows. Other TCP connections can be using the bandwidth, though.

> This entire process seems like it should be able to happen without
> causing enormous disruption, and whilst the window might be briefly

How do you define "enourmous"? A packet loss rate well below 1%
should not normally be obvious to users.

> > If a re-transmit is required, then TCP does adjust the window size
> > to accommodate what it presumes is congestion on the link. It also
> never
> > starts out streaming at full bandwidth. It continually adjusts its
> window
> > size upwards until it encounters what it interprets as congestion
> issues, or
> > the maximum window size supported by the two hosts.
> Right. I understand this as the congestion avoidance and slow start
> algorithms from RFC2581.

Correct. Note it is the window size which is being adjusted to vary
the pacing.

> >> What else should I be doing to crank up the throughput and defeat the
> >> congestion control?
> >
> > Why would you be trying to do this?
> To get the most throughput out of the link for TCP transfers between
> existing applications.

Doing what you are suggesting is probably going to result in poorer
throughput, not better. It's true TCP's congestion control does not work
terribly well on noisy links, but defeating it won't improve throughput on
noisy links. To do that, one must implement some different form of transmit
pacing, not just defeat the old one.

> > It is true TCP works well with
> > congested links, but not so well with links suffering random errors.
> You
> > aren't going to be successful in breaking the TCP handshaking parameters
> > without breaking TCP itself.
> Right. I'm not trying to break the handshaking parameters - just adjust
> the extent to which the congestion window is reduced in the face of
> packet loss, admittedly at the risk of increasing packet loss when the
> link is genuinely saturated.

It's worse than that. With a fixed window size, the amount of data
having to be streamed to make up for a lost packet increases. SACK can help
tremendously, but one is still going to encounter a great deal of additional
overhead, along with the bandwidth eaten up by the re-transmits, of course.

> Surely implementing reliable data transfer at the application level ends
> up being effectively the same as re-implementing TCP (although I guess
> you could miss out the congestion control, or find some
> application-layer mechanism for reserving bandwidth for the stream).

In the larger sense, yes. In detail, however, one can implement a
different strategy for error control. As one person mentioned, Forward
Error Correction is one possibility. FEC intentionally sacrifices some
throughput in order to make up for bit errors.

> Yes, but I really don't think that this is what is slowing my throughput
> down in this instance - instead, the congestion window is clamping the
> data rate at the sender. Looking at a tcptrace time sequence graph, I
> can see that only a small fraction of the available TCP window is ever
> used - and I can only conclude that the Tx host is just holding off on
> sending due to adhering to the artificially reduced window.

I could buy that if your packet loss rate were higher. With a 64K
window, one packet randomly lost in every 20,000 means roughly one
re-transmit every 458 window spans. That should not be enough to greatly
reduce the congestion window, and it should recover fairly well between

> Presumably a rather perverse solution to this would be a proxy to split
> a single TCP stream into multiple streams, and then reassemble them at
> the other end - thus pushing the problem into one of having large Rx
> application-layer buffers for reassembly, using socket back-pressure to
> keep the component TCP streams sufficiently in step.

I imagine there are security protocols which do this. I'm not
familiar with any specifically, but the strategy - using geographically
diverse links for each sub-stream - is a good one to prevent eavesdropping.

> Does anyone know
> if anyone's written such a thing? Or do people simply write off TCP
> single stream throughput through WANs like this?

Most IT people are happy with packet losses well below 1%, if that
is what you mean.

To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at