Re: [patch] TCP/IP delacks disabled with MPI

Josip Loncaric (josip@icase.edu)
Mon, 24 May 1999 22:13:50 -0400


Andi Kleen wrote:
>
> Josip Loncaric <josip@icase.edu> writes:
>
> > My information is second hand, but as far as I can tell, Linux TCP
> > handles congestion window according to the _amount_ of acknowledged data
> > regardless of the _number_ of packets these ACKs represent. This breaks
> > the slow start algorithm, which is supposed to exponentially open the
> > congestion window. Linux TCP opens it linearly when packets are small
> > (much less than MSS).
>
> I think you're on the wrong track by looking at the sending end. Linux increases
> the cwnd by one for every received ack that acks new data. I don't know of any
> TCP/IP stacks that do anything more fancy here, and as a pointed out in my
> other mail your simple "should count the number of packets" is an untolerable
> burden to the implementation (and also would violate RFC2001).

My reading of rfc2001 is that during slow start cwnd should open
approximately exponentially (1->2->4->8->...), not linearly
(1->2->3->4->...). I agree that counting ACKs (instead of ACKed
packets) is an understandable simplification, even though I'm told that
Jacobson's original algorithm counted ACKed packets. The rfc2001
acknowledges that the "counting ACKs" does not produce exactly
exponentially opening cwnd because of delayed ACKs, but it states that
typically one ACK will be received for every two segments sent.

This may be true for large packets, but it takes a LOT of tiny packets
to make two segments. For example, our MSS=1500 and length of TCP
packet carrying 1 byte is 41 bytes, so 73 such packets can be sent and
*still* not reach the 2*MSS limit which would force an ACK.

> I don't know what 30s timer you're refering to (you don't mean the probe timer,
> do you?)

No. This timer appears in our kernel 2.0.36 but seems to be gone by
2.2.2. Check tcp_output.c (kernel 2.0.36) and look for the line No. 297
which reads

sk->partial_timer.expires = jiffies+30*HZ;

(This line has puzzled other people in the past, which may explain why
it is gone in 2.2)

> I will try to get sysctls to turn off delayed ack and a reasonable way to
> increase initial cwnd into 2.2/2.3 (reasonable for cwnd should at least
> include a recompile, because otherwise it is too easy for users to cause too
> much harm)

While the *initial* cwnd=3 would help a bit, please note that rfc2414
allows this only if there were *no* retransmits during the initial
"three-way handshake". Moreover, the initial cwnd is quickly adjusted
upwards, so a larger initial value helps only once.

A more significant benefit of rfc2414 is that a large cwnd, once
established, is not dropped all the way to 1 the moment the socket has
been idle for an RTO period. This can happen very often during a
session, forcing lots of slow starts from 1. The rfc2414 suggests
reducing the cwnd*MSS window to a more reasonable value of
min(4*MSS,max(2*MSS,4380)) bytes (which works out to cwnd=3 in our
case).

Actually, we do the following in tcp_output.c (kernel 2.0.36) after
being idle for longer than sk->rto:

if (sk->cong_window > 3) sk->cong_window = max(3, sk->cong_window >> 1);

This works just fine. For example, cwnd=2048 is reduced to 3 in only 10
steps, which fits well with our 10 times shorter RTO floor. Retransmits
could have forced cwnd=1, so we do not touch cwnd<3 after an idle
period. Of course, "3" above should be replaced by the value computed
for the particular socket's MSS.

Sincerely,
Josip

-- 
Dr. Josip Loncaric, Senior Staff Scientist        mailto:josip@icase.edu
ICASE, Mail Stop 132C                       http://www.icase.edu/~josip/
NASA Langley Research Center             mailto:j.loncaric@larc.nasa.gov
Hampton, VA 23681-2199, USA    Tel. +1 757 864-2192  Fax +1 757 864-6134

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/