Re: [patch] TCP/IP delacks disabled with MPI

Andrea Arcangeli (andrea@suse.de)
Sun, 23 May 1999 19:23:09 +0200 (CEST)


On Fri, 21 May 1999, Andrea Arcangeli wrote:

> A TCP SHOULD implement a delayed ACK, but an ACK should not
> be excessively delayed; in particular, the delay MUST be
> less than 0.5 seconds, and in a stream of full-sized
> segments there SHOULD be an ACK for at least every second
> segment.
>
>What does it mean "at least every second segment"?
>
>2.2.9 interpret this as "at least after 2*MSS of not-acked data is just
>queued in the receiver"....
>
>If 2.2.9 is wrong about that, this is my fix (that hopefully will address
>automagically also the cwnd increase in the MPI case):
[..]
>RCS file: /var/cvs/linux/net/ipv4/tcp_input.c,v
>retrieving revision 1.1.1.11
>diff -u -r1.1.1.11 tcp_input.c
>--- linux/net/ipv4/tcp_input.c 1999/05/16 20:56:05 1.1.1.11
>+++ linux/net/ipv4/tcp_input.c 1999/05/21 15:11:12
[..]
> /* Two full frames received or... */
>- if (((tp->rcv_nxt - tp->rcv_wup) >= tp->rcv_mss * MAX_DELAY_ACK) ||
>+ if (tcp_nr_packets_not_acked(2, sk, tp) ||
^ for the record: this `2' should be
replaced with MAX_DELAY_ACK to be
more elegant :-)

I think my RFC-compliant patch makes a real difference probably only
_without_ the nagle algorithm (the TCP_NODELAY case) because otherwise TCP
tries to optimize away tinygrams. But think if the sender (other OS) for
some reason doesn't optimize the writes and is going to issue 10 writes of
1 byte each:

-- three way handshake --
sender -> receiver send 1
receiver -> sender ack 1 (with this ack the sender increase cwnd
to 2, and the receiver start to delay
its further acks)
sender -> receiver send 2
(the sender can still send out 1 frame because cwnd is 2)
sender -> receiver send 3

According to me _without_ my patch applyed (so breaking RFC1122 as Linux
seems to do), at this point the sender should block waiting for the ack
(cwnd == 2). But the receiver should stop too until the ato will timeout
(btw I was completly wrong in telling that the ato is lowbound to 200msec
(it was too late at the time of the writing...), the ato is _high_ bound
with the rto and the lowbound of the ato is 1/HZ, apologies Dave).

Here it is a little trace that show the real-world behaviour. Here the
receiver is running _without_ my patch above applyed:

18:55:10.815014 localhost.1025 > localhost.3333: S 4145295248:4145295248(0) win 31072 <mss 3884,sackOK,timestamp 6977 0,nop,wscale 0> (DF)
18:55:10.815081 localhost.3333 > localhost.1025: S 4141493600:4141493600(0) ack 4145295249 win 31072 <mss 3884,sackOK,timestamp 6977 6977,nop,wscale 0> (DF)
18:55:10.815116 localhost.1025 > localhost.3333: . ack 1 win 31072 <nop,nop,timestamp 6977 6977> (DF)
18:55:10.816343 localhost.1025 > localhost.3333: P 1:2(1) ack 1 win 31072 <nop,nop,timestamp 6978 6977> (DF)
18:55:10.816383 localhost.3333 > localhost.1025: . ack 2 win 31071 <nop,nop,timestamp 6978 6978> (DF)
^ 40usec RTT
18:55:10.817107 localhost.1025 > localhost.3333: P 2:3(1) ack 1 win 31072 <nop,nop,timestamp 6978 6978> (DF)
18:55:10.817168 localhost.1025 > localhost.3333: P 3:4(1) ack 1 win 31072 <nop,nop,timestamp 6978 6978> (DF)
18:55:10.817201 localhost.1025 > localhost.3333: P 4:5(1) ack 1 win 31072 <nop,nop,timestamp 6978 6978> (DF)
18:55:10.825904 localhost.3333 > localhost.1025: . ack 5 win 31072 <nop,nop,timestamp 6979 6978> (DF)
^^^^^ ato expired, we wasted ~10msec (first bh_timer)
btw, the sender here had cwnd == 2 but it sent out three segment, this
sounds like a Linux-sender bug but it's not relevant to my current patch I
am discussing here, comments about this?

18:55:10.825968 localhost.1025 > localhost.3333: FP 5:11(6) ack 1 win 31072 <nop,nop,timestamp 6979 6979> (DF)
^^^^ due the large
delay the sender
had the time to
pack data even
with TCP_NODELAY
set. Fine.
18:55:10.825991 localhost.3333 > localhost.1025: . ack 12 win 31065 <nop,nop,timestamp 6979 6979> (DF)
18:55:10.826489 localhost.3333 > localhost.1025: F 1:1(0) ack 12 win 31072 <nop,nop,timestamp 6979 6979> (DF)

Here instead the receiver is running _with_ my patch applyed:

18:49:12.911613 localhost.1155 > localhost.3333: S 3745394879:3745394879(0) win 31072 <mss 3884,sackOK,timestamp 1616487 0,nop,wscale 0> (DF)
18:49:12.911708 localhost.3333 > localhost.1155: S 3746598727:3746598727(0) ack 3745394880 win 31072 <mss 3884,sackOK,timestamp 1616487 1616487,nop,wscale 0> (DF)
18:49:12.911751 localhost.1155 > localhost.3333: . ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.911839 localhost.1155 > localhost.3333: P 1:2(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.911897 localhost.3333 > localhost.1155: . ack 2 win 31071 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.911934 localhost.1155 > localhost.3333: P 2:3(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.911966 localhost.1155 > localhost.3333: P 3:4(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.911984 localhost.3333 > localhost.1155: . ack 4 win 31069 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912025 localhost.1155 > localhost.3333: P 4:5(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912082 localhost.1155 > localhost.3333: P 5:6(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912101 localhost.3333 > localhost.1155: . ack 6 win 31067 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912137 localhost.1155 > localhost.3333: P 6:7(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912164 localhost.1155 > localhost.3333: P 7:8(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912181 localhost.3333 > localhost.1155: . ack 8 win 31065 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912215 localhost.1155 > localhost.3333: P 8:9(1) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912261 localhost.3333 > localhost.1155: . ack 10 win 31063 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912344 localhost.1155 > localhost.3333: F 11:11(0) ack 1 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.912372 localhost.3333 > localhost.1155: . ack 12 win 31061 <nop,nop,timestamp 1616487 1616487> (DF)
18:49:12.917330 localhost.3333 > localhost.1155: F 1:1(0) ack 12 win 31072 <nop,nop,timestamp 1616487 1616487> (DF)

The time for the total operation _without_ my patch is been 11msec, _with_
my patch instead is been 5msec.

This may explain why MPI had a high speedup by killing delayed acks. Right
now I think that killing delayed acks was just a workaround for the real
bug.

I did a new patch that have also some line of credits and uses the
MAX_DELAY_ACK #define. My guess is that with this bit fixed MPI people
won't need to kill delack-anymore to get performances. It would be nice to
get feedback about this though :-). Thanks.

New patch against 2.2.9 or 2.3.3:

Index: linux/net/ipv4/tcp_input.c
===================================================================
RCS file: /var/cvs/linux/net/ipv4/tcp_input.c,v
retrieving revision 1.1.1.10
diff -u -r1.1.1.10 tcp_input.c
--- linux/net/ipv4/tcp_input.c 1999/05/12 11:37:05 1.1.1.10
+++ linux/net/ipv4/tcp_input.c 1999/05/23 17:11:54
@@ -55,6 +55,12 @@
* work without delayed acks.
* Andi Kleen: Process packets with PSH set in the
* fast path.
+ * Andrea Arcangeli: Force an ack if we just queued
+ * MAX_DELAY_ACK not-yet-acked frames.
+ * This is required by rfc1122. This
+ * will make sure to not cause the sender
+ * to stall in slow start, and will
+ * increase cwnd at the right rate.
*/

#include <linux/config.h>
@@ -1557,6 +1597,25 @@
}
}

+static __inline__ int tcp_nr_packets_not_acked(int nr, struct sock * sk,
+ struct tcp_opt * tp)
+{
+ int __nr = 0;
+ struct sk_buff * skb = (struct sk_buff *) &sk->receive_queue;
+
+ while ((skb = skb->prev) != (struct sk_buff *) &sk->receive_queue)
+ {
+ if (TCP_SKB_CB(skb)->end_seq > tp->last_ack_sent)
+ {
+ if (++__nr >= nr)
+ return 1;
+ }
+ else
+ break;
+ }
+ return 0;
+}
+
/*
* Check if sending an ack is needed.
*/
@@ -1579,13 +1638,13 @@
*/

/* Two full frames received or... */
- if (((tp->rcv_nxt - tp->rcv_wup) >= tp->rcv_mss * MAX_DELAY_ACK) ||
+ if (tcp_nr_packets_not_acked(MAX_DELAY_ACK, sk, tp) ||
/* We will update the window "significantly" or... */
tcp_raise_window(sk) ||
/* We entered "quick ACK" mode or... */
tcp_in_quickack_mode(tp) ||
/* We have out of order data */
- (skb_peek(&tp->out_of_order_queue) != NULL)) {
+ !skb_queue_empty(&tp->out_of_order_queue)) {
/* Then ack it now */
tcp_send_ack(sk);
} else {

Comments?

Andrea Arcangeli

PS. (the patch is placed also here
ftp://e-mind.com/pub/andrea/kernel-patches/tcp-2.2.9-MPI-D)

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/