Re: [patch] TCP/IP delacks disabled with MPI

Andi Kleen (ak@muc.de)
23 May 1999 01:28:24 +0200

Messages sorted by: [ date ][ thread ][ subject ][ author ]
Next message: Richard B. Johnson: "Re: Fast TCP/IP checksum patch."
Previous message: Linus Torvalds: "Re: [PATCH] Improving send_sigio() scalability"

Josip Loncaric <josip@icase.edu> writes:

> Andrea Arcangeli wrote:
> >
> > On Fri, 21 May 1999, Josip Loncaric wrote:
> >
> > >My understanding is that Linux TCP incorrectly handles the slow start
> > >algorithm because it does not count the _number_ of packets that each
> > >ACK represents. [..]
>
> My information is second hand, but as far as I can tell, Linux TCP
> handles congestion window according to the _amount_ of acknowledged data
> regardless of the _number_ of packets these ACKs represent. This breaks
> the slow start algorithm, which is supposed to exponentially open the
> congestion window. Linux TCP opens it linearly when packets are small
> (much less than MSS).

I think you're on the wrong track by looking at the sending end. Linux increases
the cwnd by one for every received ack that acks new data. I don't know of any
TCP/IP stacks that do anything more fancy here, and as a pointed out in my
other mail your simple "should count the number of packets" is an untolerable
burden to the implementation (and also would violate RFC2001).

Slow start is such a critical thing to avoid internet meltdown, and with Linux
becoming more and more popular we really cannot risk any experiments in release
kernels here.

The problem is at the sender: it doesn't send enough acks. Linux 2.2
has a heuristic called quick-ack mode, where it acks the first packet immediately
when it thinks the other end is in slow start (this helps a lot).

For your specific application I think something like Andrea's no-delayed-acks
patch is the best, together with an increased initial cwnd. Overloading NO_DELAY
is not so hot a idea I think, because a lot of programs set it that could still
benefit from delayed acks. But I think a sysctl or perhaps a new socket option
would be ok.

It would be interesting to know how the 2.2 stack does on your workload (not
that I suggest that you should switch to it, but perhaps you could run a few
benchmarks with your workload)

> Some time constants in Linux TCP are way too long. Linux TCP estimates
> of RTO are kept 200ms or longer, although for us 20ms is more
> appropriate. This 200ms minimum implies huge windows for Fast Ethernet
> connections (20Mbit or 2.5Mbytes) which we cannot tolerate. Even 20ms
> min. RTO (->250kB window) is too much, particularly since the standard
> TCP socket buffer size is only 32kB. We'd prefer 2ms min. RTO in some
> cases.

This is a valid criticism. The 200ms lower clamp was originally introduced in 2.0
to avoid same bad interactions with BSD based stacks, which use a 200ms fixed
slow delack timer (again mainly on slow modem connections). If it could be
shown that dropping it doesn't harm 28.8 modem performance I would see no
reason to remove it.

> Delayed ACK timeouts of 500ms are standard but about 1000 times longer
> than what we'd like. The 1 second retransmit timer which applies when
> sk->users is set needs to be at least 10-100 times faster. Also, there
> is another 30 second partial packet timer in TCP, which seems completely
> out of line with our needs.

The sk->users thing has been fixed in 2.2 (ATM HZ/20, maybe still a bit too
long). It is definitely a bad bug. Perhaps that should be backported to 2.0.

I don't know what 30s timer you're refering to (you don't mean the probe timer,
do you?)

> My main point is this: we need highly interactive request-response
> communication between processes with sub-millisecond response times.
> The hope of piggy-backing ACKs on the return traffic is usually
> misplaced, although sometimes it would help reduce the workload.

I will try to get sysctls to turn off delayed ack and a reasonable way to
increase initial cwnd into 2.2/2.3 (reasonable for cwnd should at least
include a recompile, because otherwise it is too easy for users to cause too
much harm)

-Andi

-- This is like TV. I don't like TV.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/

Next message: Richard B. Johnson: "Re: Fast TCP/IP checksum patch."
Previous message: Linus Torvalds: "Re: [PATCH] Improving send_sigio() scalability"