Yet more patches to the tcp code.
The appended patch to linux 1.99.4 fixes a few more TCP speed problems.
There are still some pathalogical cases in which the current linux code
will behave worse than the 1.2.13 code. More about this in a seperate post.
The problems that this patch fixes:
(1) RFC1122 requires that an ACK be sent every 2 full sized frames.
We were only sending every 3 full sized frames, there was > that
should have been a >=. This was slowing things down a fair bit.
(2) Jacobson's revised SIGCOMM'88 paper states that changing the RTO
calculation from
RTO = R + 2V
to
RTO = R + 4V
improved performance. I've changed the code to match this.
This should improve startup on slow links a bit. It doesn't
do much for really slow links though.
(3) The implementation of Jacobson's slow start algorithm was
disabled by incorrect initialization of ssthresh. This has
been fixed. This should result in a much faster (exponential vs linear)
run up to the actual congestion window.
(4) The code for tcp_send_delayed_ack() got changed as some point
so that it was passed a maximum delay, rather than the actual delay.
In a couple of places it was still being used as though the parameter
was the actual delay. This was resulting in us sending out two acks
for every duplicate packet, which just encoraged the remote end to
send more packets than we had room for. See the comment I've added
in tcp_queue for more explanation of what's going on here.
(5) I've made some changes to the way delayed ACKs are treated.
In particular if the packet interarrival time is larger than
1/2 second, then delaying ACKs is a bad idea, since it will just
result in skewing the RTT calculation for the sender.
So, I changed things so that if sk->ato > HZ/2, we simply don't
delay the ACK at all.
Also, I made some changes to the tcp_delack_estimator so that
the ato calculation is slightly better. It could still stand
improvement. Something like what is suggested in RFC813 would be good.
Pedro has an implementation in the new IPV6 code, and I've got
an alternative implementation around from some other testing I did,
but I think we want to run some experiments before we decide on
the "right" one and toss it into the kernel. The current code
should be good enough for now.
(6) S. Floyd's paper "TCP and Successive Fast Retransmits" describes a
problem that arises in TCP implementations that do fast retransmits
in the Tahoe and Reno styles. The paper suggests a small fix
that helps avoid the problem. I've implemented this fix.
It adds the variable high_seq to the sk structure, and sets this
to the current maximum sent sequence number every time we do a
retransmit timeout or get a source quench. When checking for fast
retransmits ACKs for packets below high_seq are disallowed.
With these fixes in place I find that I can again put data across my
14.4k modem link at close to 1.6K/s, and I can generally get data
from a SunOS host at about 1.5 K/s, and up to 1.6K/s from a nearby SunOS host.
-- eric
P.S. I'll be away for a few days so I won't be able to answer any questions
about this stuff until next Tuesday or Wednesday.
---------------------------------------------------------------------------
Eric Schenk www: http://www.cs.toronto.edu/~schenk
Department of Computer Science email: schenk@cs.toronto.edu
University of Toronto
--------------------- CUT HERE --------------------------------------------
diff -u -r linux-1.99.4/include/net/sock.h linux/include/net/sock.h
--- linux-1.99.4/include/net/sock.h Thu May 16 04:58:17 1996
+++ linux/include/net/sock.h Thu May 16 19:32:26 1996
@@ -211,6 +211,7 @@
unsigned short max_unacked;
unsigned short window;
__u32 lastwin_seq; /* sequence number when we last updated the window we offer */
+ __u32 high_seq; /* sequence number when we did current fast retransmit */
volatile unsigned long ato; /* ack timeout */
volatile unsigned long lrcvtime; /* jiffies at last rcv */
unsigned short bytes_rcv;
diff -u -r linux-1.99.4/include/net/tcp.h linux/include/net/tcp.h
--- linux-1.99.4/include/net/tcp.h Thu May 16 04:58:52 1996
+++ linux/include/net/tcp.h Thu May 16 20:11:57 1996
@@ -155,7 +155,7 @@
extern void tcp_send_synack(struct sock *, struct sock *, struct sk_buff *);
extern void tcp_send_skb(struct sock *, struct sk_buff *);
extern void tcp_send_ack(struct sock *sk);
-extern void tcp_send_delayed_ack(struct sock *sk, int timeout);
+extern void tcp_send_delayed_ack(struct sock *sk, int max_timeout, unsigned long timeout);
extern void tcp_send_reset(unsigned long saddr, unsigned long daddr, struct tcphdr *th,
struct proto *prot, struct options *opt, struct device *dev, int tos, int ttl);
diff -u -r linux-1.99.4/net/ipv4/tcp.c linux/net/ipv4/tcp.c
--- linux-1.99.4/net/ipv4/tcp.c Thu May 16 04:59:14 1996
+++ linux/net/ipv4/tcp.c Fri May 17 02:08:05 1996
@@ -194,12 +194,6 @@
* against machines running Solaris,
* and seems to result in general
* improvement.
- * Eric Schenk : Changed receiver side silly window
- * avoidance algorithm to BSD style
- * algorithm. This doubles throughput
- * against machines running Solaris,
- * and seems to result in general
- * improvement.
*
* To Fix:
* Fast path the code. Two things here - fix the window calculation
@@ -519,11 +513,13 @@
{
/*
* FIXME:
- * For now we will just trigger a linear backoff.
- * The slow start code should cause a real backoff here.
+ * Follow BSD for now and just reduce cong_window to 1 again.
+ * It is possible that we just want to reduce the
+ * window by 1/2, or that we want to reduce ssthresh by 1/2
+ * here as well.
*/
- if (sk->cong_window > 4)
- sk->cong_window--;
+ sk->cong_window = 1;
+ sk->high_seq = sk->sent_seq;
return;
}
diff -u -r linux-1.99.4/net/ipv4/tcp_input.c linux/net/ipv4/tcp_input.c
--- linux-1.99.4/net/ipv4/tcp_input.c Thu May 16 04:59:15 1996
+++ linux/net/ipv4/tcp_input.c Fri May 17 02:10:32 1996
@@ -21,6 +21,10 @@
*
* FIXES
* Pedro Roque : Double ACK bug
+ * Eric Schenk : Fixes to slow start algorithm.
+ * Eric Schenk : Yet another double ACK bug.
+ * Eric Schenk : Delayed ACK bug fixes.
+ * Eric Schenk : Floyd style fast retrans war avoidance.
*/
#include <linux/config.h>
@@ -57,7 +61,15 @@
if (m <= 0)
m = 1;
- if (m > (sk->rtt >> 3))
+ /* Yikes. This used to test if m was larger than rtt/8.
+ * Maybe on a long delay high speed link this would be
+ * good initial guess, but over a slow link where the
+ * delay is dominated by transmission time this will
+ * be very bad, since ato will almost always be something
+ * more like rtt/2. Better to discard data points that
+ * are larger than the rtt estimate.
+ */
+ if (m > sk->rtt)
{
sk->ato = sk->rtt >> 3;
/*
@@ -66,6 +78,11 @@
}
else
{
+ /*
+ * Very fast acting estimator.
+ * May fluctuate too much. Probably we should be
+ * doing something like the rtt estimator here.
+ */
sk->ato = (sk->ato >> 1) + m;
/*
* printk(KERN_DEBUG "ato: m %lu\n", sk->ato);
@@ -104,14 +121,14 @@
} else {
/* no previous measure. */
sk->rtt = m<<3; /* take the measured time to be rtt */
- sk->mdev = m<<2; /* make sure rto = 3*rtt */
+ sk->mdev = m<<1; /* make sure rto = 3*rtt */
}
/*
* Now update timeout. Note that this removes any backoff.
*/
- sk->rto = ((sk->rtt >> 2) + sk->mdev) >> 1;
+ sk->rto = (sk->rtt >> 3) + sk->mdev;
if (sk->rto > 120*HZ)
sk->rto = 120*HZ;
if (sk->rto < HZ/5) /* Was 1*HZ - keep .2 as minimum cos of the BSD delayed acks */
@@ -180,11 +197,9 @@
}
/*
- * 4.3reno machines look for these kind of acks so they can do fast
- * recovery. Three identical 'old' acks lets it know that one frame has
- * been lost and should be resent. Because this is before the whole window
- * of data has timed out it can take one lost frame per window without
- * stalling. [See Jacobson RFC1323, Stevens TCP/IP illus vol2]
+ * This packet is old news. Usually this is just a resend
+ * from the far end, but sometimes it means the far end lost
+ * an ACK we send, so we better send an ACK.
*/
tcp_send_ack(sk);
}
@@ -398,13 +413,19 @@
newsk->send_head = NULL;
newsk->send_tail = NULL;
skb_queue_head_init(&newsk->back_log);
- newsk->rtt = 0; /*TCP_CONNECT_TIME<<3*/
+ newsk->rtt = 0;
newsk->rto = TCP_TIMEOUT_INIT;
- newsk->mdev = TCP_TIMEOUT_INIT<<1;
+ newsk->mdev = TCP_TIMEOUT_INIT;
newsk->max_window = 0;
+ /*
+ * See draft-stevens-tcpca-spec-01 for discussion of the
+ * initialization of these values.
+ */
newsk->cong_window = 1;
newsk->cong_count = 0;
- newsk->ssthresh = 0;
+ newsk->ssthresh = 0x7fffffff;
+
+ newsk->high_seq = 0;
newsk->backoff = 0;
newsk->blog = 0;
newsk->intr = 0;
@@ -684,7 +705,7 @@
* interpreting "new data is acked" as including data that has
* been retransmitted but is just now being acked.
*/
- if (sk->cong_window < sk->ssthresh)
+ if (sk->cong_window <= sk->ssthresh)
/*
* In "safe" area, increase
*/
@@ -720,6 +741,8 @@
* (2) it has the same window as the last ACK,
* (3) we have outstanding data that has not been ACKed
* (4) The packet was not carrying any data.
+ * (5) [From Floyds paper on fast retransmit wars]
+ * The packet acked data after high_seq;
* I've tried to order these in occurrence of most likely to fail
* to least likely to fail.
* [These are the rules BSD stacks use to determine if an ACK is a
@@ -729,7 +752,8 @@
if (sk->rcv_ack_seq == ack
&& sk->window_seq == window_seq
&& !(flag&1)
- && before(ack, sk->sent_seq))
+ && before(ack, sk->sent_seq)
+ && after(ack, sk->high_seq))
{
/* See draft-stevens-tcpca-spec-01 for explanation
* of what we are doing here.
@@ -738,12 +762,16 @@
if (sk->rcv_ack_cnt == MAX_DUP_ACKS+1) {
sk->ssthresh = max(sk->cong_window >> 1, 2);
sk->cong_window = sk->ssthresh+MAX_DUP_ACKS+1;
- tcp_do_retransmit(sk,0);
- /* reduce the count. We don't want to be
- * seen to be in "retransmit" mode if we
- * are doing a fast retransmit.
- */
+ /* FIXME:
+ * reduce the count. We don't want to be
+ * seen to be in "retransmit" mode if we
+ * are doing a fast retransmit.
+ * This is also a signal to tcp_do_retransmit
+ * not to set sk->high_seq.
+ * This is a horrible ugly hack.
+ */
sk->retransmits--;
+ tcp_do_retransmit(sk,0);
} else if (sk->rcv_ack_cnt > MAX_DUP_ACKS+1) {
sk->cong_window++;
/*
@@ -795,7 +823,18 @@
* Recompute rto from rtt. this eliminates any backoff.
*/
- sk->rto = ((sk->rtt >> 2) + sk->mdev) >> 1;
+ /*
+ * Appendix C of Van Jacobson's final version of
+ * the SIGCOMM 88 paper states that although
+ * the original paper suggested that
+ * RTO = R*2V
+ * was the correct calculation experience showed
+ * better results using
+ * RTO = R*4V
+ * In particular this gives better performance over
+ * slow links, and should not effect fast links.
+ */
+ sk->rto = (sk->rtt >> 3) + sk->mdev;
if (sk->rto > 120*HZ)
sk->rto = 120*HZ;
if (sk->rto < HZ/5) /* Was 1*HZ, then 1 - turns out we must allow about
@@ -827,7 +866,7 @@
break;
if (sk->retransmits)
- {
+ {
/*
* We were retransmitting. don't count this in RTT est
*/
@@ -1322,7 +1361,7 @@
int delay = HZ/2;
if (th->psh)
delay = HZ/50;
- tcp_send_delayed_ack(sk, delay);
+ tcp_send_delayed_ack(sk, delay, sk->ato);
}
/*
@@ -1357,7 +1396,15 @@
if(sk->debug)
printk("Ack past end of seq packet.\n");
tcp_send_ack(sk);
- tcp_send_delayed_ack(sk,HZ/2);
+ /*
+ * We need to be very careful here. We must
+ * not violate Jacobsons packet conservation condition.
+ * This means we should only send an ACK when a packet
+ * leaves the network. We can say a packet left the
+ * network when we see a packet leave the network, or
+ * when an rto measure expires.
+ */
+ tcp_send_delayed_ack(sk,sk->rto,sk->rto);
}
}
}
@@ -1397,7 +1444,8 @@
kfree_skb(skb, FREE_READ);
return(0);
}
-
+
+
/*
* We no longer have anyone receiving data on this connection.
*/
@@ -1455,6 +1503,11 @@
#endif
+ /*
+ * We should only call this if there is data in the frame.
+ */
+ tcp_delack_estimator(sk);
+
tcp_queue(skb, sk, th);
return(0);
@@ -1900,8 +1953,6 @@
return tcp_reset(sk,skb);
}
- tcp_delack_estimator(sk);
-
/*
* Process the ACK
*/
diff -u -r linux-1.99.4/net/ipv4/tcp_output.c linux/net/ipv4/tcp_output.c
--- linux-1.99.4/net/ipv4/tcp_output.c Thu May 16 04:59:16 1996
+++ linux/net/ipv4/tcp_output.c Fri May 17 01:10:15 1996
@@ -188,7 +188,7 @@
tcp_send_check(th, sk->saddr, sk->daddr, size, skb);
sk->sent_seq = sk->write_seq;
-
+
/*
* This is mad. The tcp retransmit queue is put together
* by the ip layer. This causes half the problems with
@@ -527,6 +527,7 @@
}
}
+
/*
* Count retransmissions
*/
@@ -535,6 +536,13 @@
sk->retransmits++;
sk->prot->retransmits++;
tcp_statistics.TcpRetransSegs++;
+
+ /*
+ * Record the high sequence number to help avoid doing
+ * to much fast retransmission.
+ */
+ if (sk->retransmits)
+ sk->high_seq = sk->sent_seq;
/*
@@ -821,20 +829,27 @@
* - delay time <= 0.5 HZ
* - must send at least every 2 full sized packets
* - we don't have a window update to send
+ *
+ * additional thoughts:
+ * - we should not delay sending an ACK if we have ato > 0.5 HZ.
+ * My thinking about this is that in this case we will just be
+ * systematically skewing the RTT calculation. (The rule about
+ * sending every two full sized packets will never need to be
+ * invoked, the delayed ack will be sent before the ATO timeout
+ * every time. Of course, the relies on our having a good estimate
+ * for packet interarrival times.
*/
-void tcp_send_delayed_ack(struct sock * sk, int max_timeout)
+void tcp_send_delayed_ack(struct sock * sk, int max_timeout, unsigned long timeout)
{
- unsigned long timeout, now;
+ unsigned long now;
/* Calculate new timeout */
now = jiffies;
- timeout = sk->ato;
- if (timeout > max_timeout)
- timeout = max_timeout;
- timeout += now;
- if (sk->bytes_rcv > sk->max_unacked) {
+ if (timeout > max_timeout || sk->bytes_rcv >= sk->max_unacked) {
timeout = now;
mark_bh(TIMER_BH);
+ } else {
+ timeout += now;
}
/* Use new timeout only if there wasn't a older one earlier */
@@ -894,7 +909,7 @@
* resend packets.
*/
- tcp_send_delayed_ack(sk, HZ/2);
+ tcp_send_delayed_ack(sk, HZ/2, HZ/2);
return;
}
--------------------- CUT HERE --------------------------------------------