Fixes for TCP retransmission bugs

Eric Schenk (schenk@rnode84.cs.toronto.edu)
Thu, 30 May 1996 03:01:47 -0400


Hi All,

Back from conferences and a short vacation, and I've got another round
of bug fixes for the TCP stack. This patch covers a raft of newly
discovered bugs amoung other things. In general I've tried to clean
some of the code up a bit while fixing these bugs.
I've got at least two outstanding reports of extremly slow TCP
performance that do not seem to be solved by the attached patches.
I'm still looking into these cases.

As an aside, I think I *may* also have fixed the bug with sockets hanging
around in the CLOSED state. It turns out there was a piece of code
that called delete_timer() when it should have been calling del_timer().
This deleted the socket timer instead of removing the retransmit timer.
I've fixed this. I'm not sure it will fix the problem though since I
was never able to reproduce it. Anyone who was seeing this problem,
please let me know if it is still present.

Anyway, on to the new bugs. After a recent (disastrous) attempt to
gain more speed on my Internet connection by installing a cheap 28.8
Sportster clone, I discovered yet another set of bugs in the TCP code.
One was revealed by the congestion window fixes that appeared in 1.99.6.
The others are related. All have to do with the handling of retransmission
of packets on timeout. The presence of these bugs can reduce
outgoing TCP to a crawl, or even cause sessions to disconnect
during short periods of heavy congestion.

More specifically the attached patches fix the following problems.

First, we are counting the number of consecutive retransmit
timeouts incorrectly. Instead of incrementing the
counter each time an actual timeout occurs when we enter retransmit
mode, we are counting 1 for each packet we resend. If the retransmission
buffer is large (as can now happen since the congestion window growth
code was fixed), then we can end up forcing the connection to reset
before recovering from the fault, even though only one round trip
timeout has occurred.

Second, we are overdoing the retransmission considerably.
When a retransmission timeout occurs, we shrink the congestion
window to 1, and then send out as many packets from the retransmission
queue as allowed by the current congestion window.
>From this point on, each time an ACK arrives we grow the
congestion window normally, and then proceed to resend as many
packets from the retransmission queue as allowed. The problem
here is that we always start resending from the start of the queue.
Consider a queue with packets numbered 1 through 10.
Initially we send out packet 1, then receive an ack for it,
send out packets 2 and 3, receive an ack, send out packets 3 and 4 and 5,
receive an ack, send out packets 4, 5 and 6 and 7, etc. As can be
seen this results in packets toward the end of the queue being
retransmitted a number of times proportional to the length of the queue.

This violates the conditions Jacobson needs to prove that exponential
backoff is sufficient to prevent network collapse in the event of congestion.
In particular it is necessary that within a timeout period each packet
in the retransmission queue should be sent at most once.

The fact that Linux violates these conditions is unlikely to cause much
trouble in the case that there are only a few linux boxes on a network,
but if a network is populated only with linux boxes, and the load
gets high enough, the linux stack could end up thrashing the network
to death as currently the traffic it generates only backs off linearly.

I added a new variable (send_next) to the socket structure and
I've changed the retransmission code so that it advances this
pointer each time it sends a packet. Whenever we get a timeout
the send_next pointer is returned to the head of the queue.
This limits the number of times each packet is sent to one per timeout.

Third, we are setting the retransmission timeout relative to the last packet
in the queue. This can result in slower than necessary recovery from a fault.
Oddly enough, this same problem can also lead to a situation in which we have
a timeout that is too early. In particular if we are transmitting a small
amount of data (less than the size of the window offered by the remote
receiver), and there are no congestion problems on the network, then
the timeout set for the last packet at the time it is sent will
necessarily (!) expire before the the ack can arrive. [If anyone _really_
cares I could type in the proof, but since I don't see a paper coming out
of it, I'm unlikely to bother without some kind of provocation...]
To correct both these problems I've changed things so that we attach
a timeout to the first packet in the retransmission queue, and each time we
receive an ack that changes the head of the queue we set the
timeout for that packet based on the newly calculated timeout.
This takes care of both problems with the old settings for timeouts
and has the happy side effect of cleaning up the code in tcp_ack()
a fair bit.

Finally, there is one more fix included here that prevents
the use of fast retransmit if the congestion window is not at
least 3. This should return the performance over really bad links
to the pre "fast retransmit" levels.

So, without further ado, here are the promised patches. They
are against linux-1.99.8.

-- eric

---------------------------------------------------------------------------
Eric Schenk www: http://www.cs.toronto.edu/~schenk
Department of Computer Science email: schenk@cs.toronto.edu
University of Toronto

diff -r -u linux-1.99.8/include/net/sock.h linux/include/net/sock.h
--- linux-1.99.8/include/net/sock.h Thu May 23 10:09:38 1996
+++ linux/include/net/sock.h Wed May 29 02:07:22 1996
@@ -196,6 +196,7 @@
struct sock *prev; /* Doubly linked chain.. */
struct sock *pair;
struct sk_buff * volatile send_head;
+ struct sk_buff * volatile send_next;
struct sk_buff * volatile send_tail;
struct sk_buff_head back_log;
struct sk_buff *partial;
diff -r -u linux-1.99.8/net/ipv4/af_inet.c linux/net/ipv4/af_inet.c
--- linux-1.99.8/net/ipv4/af_inet.c Thu May 23 10:09:41 1996
+++ linux/net/ipv4/af_inet.c Wed May 29 01:24:24 1996
@@ -382,6 +382,8 @@
skb = skb2;
}
sk->send_head = NULL;
+ sk->send_tail = NULL;
+ sk->send_next = NULL;
sti();

/*
diff -r -u linux-1.99.8/net/ipv4/ip_output.c linux/net/ipv4/ip_output.c
--- linux-1.99.8/net/ipv4/ip_output.c Thu May 23 10:09:44 1996
+++ linux/net/ipv4/ip_output.c Wed May 29 01:24:24 1996
@@ -450,6 +450,7 @@
{
sk->send_tail = skb;
sk->send_head = skb;
+ sk->send_next = skb;
}
else
{
diff -r -u linux-1.99.8/net/ipv4/tcp.c linux/net/ipv4/tcp.c
--- linux-1.99.8/net/ipv4/tcp.c Thu May 23 10:09:44 1996
+++ linux/net/ipv4/tcp.c Wed May 29 13:44:00 1996
@@ -2003,10 +2003,7 @@
sk->delack_timer.data = (unsigned long) sk;
sk->retransmit_timer.function = tcp_retransmit_timer;
sk->retransmit_timer.data = (unsigned long)sk;
- tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto); /* Timer for repeating the SYN until an answer */
- sk->retransmits = 0; /* Now works the right way instead of a hacked
- initial setting */
-
+ sk->retransmits = 0;
sk->prot->queue_xmit(sk, dev, buff, 0);
tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
tcp_statistics.TcpActiveOpens++;
diff -r -u linux-1.99.8/net/ipv4/tcp_input.c linux/net/ipv4/tcp_input.c
--- linux-1.99.8/net/ipv4/tcp_input.c Thu May 23 10:09:45 1996
+++ linux/net/ipv4/tcp_input.c Thu May 30 01:10:29 1996
@@ -25,6 +25,9 @@
* Eric Schenk : Yet another double ACK bug.
* Eric Schenk : Delayed ACK bug fixes.
* Eric Schenk : Floyd style fast retrans war avoidance.
+ * Eric Schenk : Skip fast retransmit on small windows.
+ * Eric schenk : Fixes to retransmission code to
+ * : avoid extra retransmission.
*/

#include <linux/config.h>
@@ -404,6 +407,7 @@
skb_queue_head_init(&newsk->receive_queue);
newsk->send_head = NULL;
newsk->send_tail = NULL;
+ newsk->send_next = NULL;
skb_queue_head_init(&newsk->back_log);
newsk->rtt = 0;
newsk->rto = TCP_TIMEOUT_INIT;
@@ -562,6 +566,7 @@
skb2 = sk->send_head;
sk->send_head = NULL;
sk->send_tail = NULL;
+ sk->send_next = NULL;

/*
* This is an artifact of a flawed concept. We want one
@@ -595,6 +600,7 @@
{
sk->send_head = skb;
sk->send_tail = skb;
+ sk->send_next = skb;
}
else
{
@@ -685,6 +691,7 @@
{
sk->send_head = NULL;
sk->send_tail = NULL;
+ sk->send_next = NULL;
sk->packets_out= 0;
}

@@ -745,8 +752,8 @@
* The packet acked data after high_seq;
* I've tried to order these in occurrence of most likely to fail
* to least likely to fail.
- * [These are the rules BSD stacks use to determine if an ACK is a
- * duplicate.]
+ * [These are an extension of the rules BSD stacks use to
+ * determine if an ACK is a duplicate.]
*/

if (sk->rcv_ack_seq == ack
@@ -755,22 +762,23 @@
&& before(ack, sk->sent_seq)
&& after(ack, sk->high_seq))
{
+ /* Prevent counting of duplicate ACKs if the congestion
+ * window is smaller than 3. Note that since we reduce
+ * the congestion window when we do a fast retransmit,
+ * we must be careful to keep counting if we were already
+ * counting. The idea behind this is to avoid doing
+ * fast retransmits if the congestion window is so small
+ * that we cannot get 3 ACKs due to the loss of a packet
+ * unless we are getting ACKs for retransmitted packets.
+ */
+ if (sk->cong_window >= 3 || sk->rcv_ack_cnt > MAX_DUP_ACKS+1)
+ sk->rcv_ack_cnt++;
/* See draft-stevens-tcpca-spec-01 for explanation
* of what we are doing here.
*/
- sk->rcv_ack_cnt++;
if (sk->rcv_ack_cnt == MAX_DUP_ACKS+1) {
sk->ssthresh = max(sk->cong_window >> 1, 2);
sk->cong_window = sk->ssthresh+MAX_DUP_ACKS+1;
- /* FIXME:
- * reduce the count. We don't want to be
- * seen to be in "retransmit" mode if we
- * are doing a fast retransmit.
- * This is also a signal to tcp_do_retransmit
- * not to set sk->high_seq.
- * This is a horrible ugly hack.
- */
- sk->retransmits--;
tcp_do_retransmit(sk,0);
} else if (sk->rcv_ack_cnt > MAX_DUP_ACKS+1) {
sk->cong_window++;
@@ -878,6 +886,13 @@
sk->send_tail = NULL;
sk->retransmits = 0;
}
+
+ /*
+ * advance the send_next pointer if needed.
+ */
+ if (sk->send_next == skb)
+ sk->send_next = sk->send_head;
+
/*
* Note that we only reset backoff and rto in the
* rtt recomputation code. And that doesn't happen
@@ -916,86 +931,93 @@
}

/*
- * XXX someone ought to look at this too.. at the moment, if skb_peek()
- * returns non-NULL, we complete ignore the timer stuff in the else
- * clause. We ought to organize the code so that else clause can
- * (should) be executed regardless, possibly moving the PROBE timer
- * reset over. The skb_peek() thing should only move stuff to the
- * write queue, NOT also manage the timer functions.
- */
-
- /*
* Maybe we can take some stuff off of the write queue,
* and put it onto the xmit queue.
+ * FIXME: (?) There is bizzare case being tested here, to check if
+ * the data at the head of the queue ends before the start of
+ * the sequence we already ACKed. This does not appear to be
+ * a case that can actually occur. Why are we testing it?
*/
- if (skb_peek(&sk->write_queue) != NULL)
- {
- if (!before(sk->window_seq, sk->write_queue.next->end_seq) &&
- (sk->retransmits == 0 ||
- sk->ip_xmit_timeout != TIME_WRITE ||
- !after(sk->write_queue.next->end_seq, sk->rcv_ack_seq))
- && sk->packets_out < sk->cong_window)
- {
- /*
- * Add more data to the send queue.
- */
- flag |= 1;
- tcp_write_xmit(sk);
- }
- else if (before(sk->window_seq, sk->write_queue.next->end_seq) &&
- sk->send_head == NULL &&
- sk->ack_backlog == 0 &&
- sk->state != TCP_TIME_WAIT)
- {
- /*
- * Data to queue but no room.
- */
- tcp_reset_xmit_timer(sk, TIME_PROBE0, sk->rto);
- }
- }
- else
+
+ if (!skb_queue_empty(&sk->write_queue) &&
+ !before(sk->window_seq, sk->write_queue.next->end_seq) &&
+ (sk->retransmits == 0 ||
+ sk->ip_xmit_timeout != TIME_WRITE ||
+ !after(sk->write_queue.next->end_seq, sk->rcv_ack_seq)) &&
+ sk->packets_out < sk->cong_window)
{
/*
- * from TIME_WAIT we stay in TIME_WAIT as long as we rx packets
- * from TCP_CLOSE we don't do anything
- *
- * from anything else, if there is write data (or fin) pending,
- * we use a TIME_WRITE timeout, else if keepalive we reset to
- * a KEEPALIVE timeout, else we delete the timer.
- *
- * We do not set flag for nominal write data, otherwise we may
- * force a state where we start to write itsy bitsy tidbits
- * of data.
+ * Add more data to the send queue.
*/
+ flag |= 1;
+ tcp_write_xmit(sk);
+ }

- switch(sk->state) {
- case TCP_TIME_WAIT:
- /*
- * keep us in TIME_WAIT until we stop getting packets,
- * reset the timeout.
- */
- tcp_reset_msl_timer(sk, TIME_CLOSE, TCP_TIMEWAIT_LEN);
- break;
- case TCP_CLOSE:
- /*
- * don't touch the timer.
- */
- break;
- default:
- /*
- * Must check send_head and write_queue
- * to determine which timeout to use.
- */
- if (sk->send_head || !skb_queue_empty(&sk->write_queue)) {
- tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
- } else if (sk->keepopen) {
- tcp_reset_xmit_timer(sk, TIME_KEEPOPEN, TCP_TIMEOUT_LEN);
- } else {
- del_timer(&sk->retransmit_timer);
- sk->ip_xmit_timeout = 0;
+ /*
+ * Reset timers to reflect the new state.
+ *
+ * from TIME_WAIT we stay in TIME_WAIT as long as we rx packets
+ * from TCP_CLOSE we don't do anything
+ *
+ * from anything else, if there is queued data (or fin) pending,
+ * we use a TIME_WRITE timeout, if there is data to write but
+ * no room in the window we use TIME_PROBE0, else if keepalive
+ * we reset to a KEEPALIVE timeout, else we delete the timer.
+ *
+ * We do not set flag for nominal write data, otherwise we may
+ * force a state where we start to write itsy bitsy tidbits
+ * of data.
+ */
+
+ switch(sk->state) {
+ case TCP_TIME_WAIT:
+ /*
+ * keep us in TIME_WAIT until we stop getting packets,
+ * reset the timeout.
+ */
+ tcp_reset_msl_timer(sk, TIME_CLOSE, TCP_TIMEWAIT_LEN);
+ break;
+ case TCP_CLOSE:
+ /*
+ * don't touch the timer.
+ */
+ break;
+ default:
+ /*
+ * Must check send_head and write_queue
+ * to determine which timeout to use.
+ */
+ if (sk->send_head) {
+ tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
+ } else if (!skb_queue_empty(&sk->write_queue)) {
+ /*
+ * if the write queue is not empty when we get here
+ * then we failed to move any data to the retransmit
+ * queue above. (If we had send_head would be non-NULL).
+ * Furthermore, since the send_head is NULL here
+ * we must not be in retransmit mode at this point.
+ * This implies we have no packets in flight,
+ * hence sk->packets_out < sk->cong_window.
+ * Examining the conditions for the test to move
+ * data to the retransmission queue we find that
+ * we must therefore have a zero window.
+ * Hence, if the ack_backlog is 0 we should initiate
+ * a zero probe.
+ */
+ if (sk->ack_backlog == 0) {
+ /*
+ * Data to queue but no room.
+ * Set up a zero window probe timeout.
+ */
+ tcp_reset_xmit_timer(sk, TIME_PROBE0, sk->rto);
}
- break;
+ } else if (sk->keepopen) {
+ tcp_reset_xmit_timer(sk, TIME_KEEPOPEN, TCP_TIMEOUT_LEN);
+ } else {
+ del_timer(&sk->retransmit_timer);
+ sk->ip_xmit_timeout = 0;
}
+ break;
}

/*
@@ -1094,45 +1116,18 @@
}

/*
- * I make no guarantees about the first clause in the following
- * test, i.e. "(!flag) || (flag&4)". I'm not entirely sure under
- * what conditions "!flag" would be true. However I think the rest
- * of the conditions would prevent that from causing any
- * unnecessary retransmission.
- * Clearly if the first packet has expired it should be
- * retransmitted. The other alternative, "flag&2 && retransmits", is
- * harder to explain: You have to look carefully at how and when the
- * timer is set and with what timeout. The most recent transmission always
- * sets the timer. So in general if the most recent thing has timed
- * out, everything before it has as well. So we want to go ahead and
- * retransmit some more. If we didn't explicitly test for this
- * condition with "flag&2 && retransmits", chances are "when + rto < jiffies"
- * would not be true. If you look at the pattern of timing, you can
- * show that rto is increased fast enough that the next packet would
- * almost never be retransmitted immediately. Then you'd end up
- * waiting for a timeout to send each packet on the retransmission
- * queue. With my implementation of the Karn sampling algorithm,
- * the timeout would double each time. The net result is that it would
- * take a hideous amount of time to recover from a single dropped packet.
- * It's possible that there should also be a test for TIME_WRITE, but
- * I think as long as "send_head != NULL" and "retransmit" is on, we've
- * got to be in real retransmission mode.
- * Note that tcp_do_retransmit is called with all==1. Setting cong_window
- * back to 1 at the timeout will cause us to send 1, then 2, etc. packets.
- * As long as no further losses occur, this seems reasonable.
+ * The following code has been greatly simplified from the
+ * old hacked up stuff. The wonders of properly setting the
+ * retransmission timeouts.
+ *
+ * If we are retransmitting, and we acked a packet on the retransmit
+ * queue, and there is still something in the retransmit queue,
+ * then we can output some retransmission packets.
*/
-
- if (((!flag) || (flag&4)) && sk->send_head != NULL &&
- (((flag&2) && sk->retransmits) ||
- (sk->send_head->when + sk->rto < jiffies)))
- {
- if(sk->send_head->when + sk->rto < jiffies)
- tcp_retransmit(sk,0);
- else
- {
- tcp_do_retransmit(sk, 1);
- tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
- }
+
+ if (sk->send_head != NULL && (flag&2) && sk->retransmits)
+ {
+ tcp_do_retransmit(sk, 1);
}

return 1;
@@ -1230,8 +1225,12 @@
* for handling this timeout.
*/

- if(sk->ip_xmit_timeout != TIME_WRITE)
- tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
+ if (sk->ip_xmit_timeout != TIME_WRITE) {
+ if (sk->send_head)
+ tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
+ else
+ printk(KERN_ERR "send_head NULL in FIN_WAIT1\n");
+ }
tcp_set_state(sk,TCP_CLOSING);
break;
case TCP_FIN_WAIT2:
@@ -1965,7 +1964,7 @@
* Note most of these are inline now. I'll inline the lot when
* I have time to test it hard and look at what gcc outputs
*/
-
+
if (!tcp_sequence(sk, skb->seq, skb->end_seq-th->syn))
{
bad_tcp_sequence(sk, th, skb->end_seq-th->syn, dev);
diff -r -u linux-1.99.8/net/ipv4/tcp_output.c linux/net/ipv4/tcp_output.c
--- linux-1.99.8/net/ipv4/tcp_output.c Thu May 23 10:08:55 1996
+++ linux/net/ipv4/tcp_output.c Thu May 30 01:15:20 1996
@@ -18,6 +18,9 @@
* Matthew Dillon, <dillon@apollo.west.oic.com>
* Arnt Gulbrandsen, <agulbra@nvg.unit.no>
* Jorge Cwik, <jorge@laser.satlink.net>
+ *
+ * Fixes: Eric Schenk : avoid multiple retransmissions in one
+ * : round trip timeout.
*/

#include <linux/config.h>
@@ -175,7 +178,7 @@
if (before(sk->window_seq, sk->write_queue.next->end_seq) &&
sk->send_head == NULL && sk->ack_backlog == 0)
tcp_reset_xmit_timer(sk, TIME_PROBE0, sk->rto);
- }
+ }
else
{
/*
@@ -198,9 +201,9 @@
sk->prot->queue_xmit(sk, skb->dev, skb, 0);

/*
- * Set for next retransmit based on expected ACK time.
- * FIXME: We set this every time which means our
- * retransmits are really about a window behind.
+ * Set for next retransmit based on expected ACK time
+ * of the first packet in the resend queue.
+ * This is no longer a window behind.
*/

tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
@@ -364,10 +367,6 @@

clear_delayed_acks(sk);

- /*
- * Again we slide the timer wrongly
- */
-
tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
}
}
@@ -384,10 +383,17 @@
struct sk_buff * skb;
struct proto *prot;
struct device *dev;
- int ct=0;
struct rtable *rt;

prot = sk->prot;
+ if (!all) {
+ /*
+ * If we are just retransmitting one packet reset
+ * to the start of the queue.
+ */
+ sk->send_next = sk->send_head;
+ sk->packets_out = 0;
+ }
skb = sk->send_head;

while (skb != NULL)
@@ -399,7 +405,7 @@
dev = skb->dev;
IS_SKB(skb);
skb->when = jiffies;
-
+
/* dl1bke 960201 - @%$$! Hope this cures strange race conditions */
/* with AX.25 mode VC. (esp. DAMA) */
/* if the buffer is locked we should not retransmit */
@@ -523,17 +529,15 @@
/* Now queue it */
ip_statistics.IpOutRequests++;
dev_queue_xmit(skb, dev, sk->priority);
+ sk->packets_out++;
}
}
}
-

/*
* Count retransmissions
*/

- ct++;
- sk->retransmits++;
sk->prot->retransmits++;
tcp_statistics.TcpRetransSegs++;

@@ -544,6 +548,11 @@
if (sk->retransmits)
sk->high_seq = sk->sent_seq;

+ /*
+ * Advance the send_next pointer so we don't keep
+ * retransmitting the same stuff every time we get an ACK.
+ */
+ sk->send_next = skb->link3;

/*
* Only one retransmit requested.
@@ -556,8 +565,9 @@
* This should cut it off before we send too many packets.
*/

- if (ct >= sk->cong_window)
+ if (sk->packets_out >= sk->cong_window)
break;
+
skb = skb->link3;
}
}
@@ -888,10 +898,10 @@
&& skb_queue_empty(&sk->write_queue)
&& sk->ip_xmit_timeout == TIME_WRITE)
{
- if(sk->keepopen)
+ if (sk->keepopen)
tcp_reset_xmit_timer(sk,TIME_KEEPOPEN,TCP_TIMEOUT_LEN);
else
- delete_timer(sk);
+ del_timer(&sk->retransmit_timer);
}

/*
@@ -946,7 +956,7 @@

tcp_send_check(t1, sk->saddr, sk->daddr, sizeof(*t1), buff);
if (sk->debug)
- printk("\rtcp_ack: seq %x ack %x\n", sk->sent_seq, sk->acked_seq);
+ printk(KERN_ERR "\rtcp_ack: seq %x ack %x\n", sk->sent_seq, sk->acked_seq);
sk->prot->queue_xmit(sk, dev, buff, 1);
tcp_statistics.TcpOutSegs++;
}
diff -r -u linux-1.99.8/net/ipv4/tcp_timer.c linux/net/ipv4/tcp_timer.c
--- linux-1.99.8/net/ipv4/tcp_timer.c Thu May 23 10:05:56 1996
+++ linux/net/ipv4/tcp_timer.c Thu May 30 00:27:09 1996
@@ -18,6 +18,10 @@
* Matthew Dillon, <dillon@apollo.west.oic.com>
* Arnt Gulbrandsen, <agulbra@nvg.unit.no>
* Jorge Cwik, <jorge@laser.satlink.net>
+ *
+ * Fixes:
+ *
+ * Eric Schenk : Fix retransmission timeout counting.
*/

#include <net/tcp.h>
@@ -35,12 +39,33 @@
{
del_timer(&sk->retransmit_timer);
sk->ip_xmit_timeout = why;
- if((long)when < 0)
- {
- when=3;
- printk(KERN_ERR "Error: Negative timer in xmit_timer\n");
+ if (why == TIME_WRITE) {
+ /* In this case we want to timeout on the first packet
+ * in the resend queue. If the resend queue is empty,
+ * then the packet we are sending hasn't made it there yet,
+ * so we timeout from the current time.
+ */
+ if (sk->send_head) {
+ sk->retransmit_timer.expires =
+ sk->send_head->when + when;
+ } else {
+ /* This should never happen!
+ */
+ printk(KERN_ERR "Error: send_head NULL in xmit_timer\n");
+ sk->ip_xmit_timeout = 0;
+ return;
+ }
+ } else {
+ sk->retransmit_timer.expires = jiffies+when;
+ }
+
+ if (sk->retransmit_timer.expires < jiffies) {
+ /* We can get here if we reset the timer on an event
+ * that could not fire because the interupts where disabled.
+ * make sure it happens soon.
+ */
+ sk->retransmit_timer.expires = jiffies+2;
}
- sk->retransmit_timer.expires=jiffies+when;
add_timer(&sk->retransmit_timer);
}

@@ -56,6 +81,15 @@

static void tcp_retransmit_time(struct sock *sk, int all)
{
+ /*
+ * record how many times we've timed out.
+ * This determines when we should quite trying.
+ * This needs to be counted here, because we should not be
+ * counting one per packet we send, but rather one per round
+ * trip timeout.
+ */
+ sk->retransmits++;
+
tcp_do_retransmit(sk, all);

/*
@@ -77,7 +111,12 @@

sk->backoff++;
sk->rto = min(sk->rto << 1, 120*HZ);
- tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
+
+ /* be paranoid about the data structure... */
+ if (sk->send_head)
+ tcp_reset_xmit_timer(sk, TIME_WRITE, sk->rto);
+ else
+ printk(KERN_ERR "send_head NULL in tcp_retransmit_time\n");
}

/*
@@ -101,7 +140,6 @@
sk->ssthresh = sk->cong_window >> 1; /* remember window where we lost */
/* sk->ssthresh in theory can be zero. I guess that's OK */
sk->cong_count = 0;
-
sk->cong_window = 1;

/* Do the actual retransmit. */