TCP problem in 2.2/2.3 kernels (?)

Simon Kirby (sim@netnation.com)
Thu, 9 Sep 1999 18:10:39 -0400


It has happened to me several times in the past that a connection to my
mail server across a sometimes-packetlossy cablemodem connection appears
to die and not come back at all or only after several minutes after the
packet loss stops and "ping -c 50" shows no loss at all. This has
happened to me several times in the past, and it just happened to me
again (and is still happening in another console), with kernel 2.3.17
across the wire to server kernel 2.2.12 (I initiated the connection on
the 2.3.17 machine). This has happened before between 2.2 kernels of
older versions as well, however.

I was typing an email via SSH which goes over eth0 through a local
crossover ethernet connection to another box which is connected to the
cablemodem and masquerades everything that wants to go out to the
Internet and sends it out the cablemodem. I don't think the masqing is
the problem, though, as I'm seeing what looks to me like a problem with
TCP viewing from tcpdump on the host that isn't doing any masqing.

Basically, here's what happened:

I was typing the email, and suddenly what I was typing stopped appearing.
The connection I had open to the mail server has been up for hours, and
it usually has an average ping of about 80ms. I switched to another
console and did a "ping"...the first few packets were dropped, and then
the connection came back and it was fine for the rest of the pings.

I went back to the mail console and typed a few more characters -- still
nothing. I did a "ping -c 50" in the other console, and when it finished
it dropped 2 of the 50 packets. I went back and still nothing.

I went to the other console, and typed some commands:

[sroot@oof:/root]# netstat -a -n | grep :22
tcp 0 0 192.168.143.2:1020 204.174.223.27:22 ESTABLISHED
tcp 0 0 192.168.143.2:1019 204.174.223.27:22 ESTABLISHED
tcp 0 0 192.168.143.2:1022 204.174.223.11:22 TIME_WAIT
tcp 0 0 192.168.143.2:1021 204.174.223.11:22 ESTABLISHED
tcp 0 80 192.168.143.2:1023 204.174.223.2:22 ESTABLISHED
[sroot@oof:/root]# tcpdump -i eth0 -n port 1023 and port 22
tcpdump uses obsolete (PF_INET,SOCK_PACKET)
device eth0 entered promiscuous mode
tcpdump: listening on eth0
17:39:40.359268 192.168.143.2.1023 > 204.174.223.2.22: P 2085755509:2085755589(80) ack 2073415618 win 31856 <nop,nop,timestamp 2446022 144513894> (DF) [tos 0x10]
17:39:40.468087 204.174.223.2.22 > 192.168.143.2.1023: . ack 80 win 32120 <nop,nop,timestamp 144517762 2446022> (DF) [tos 0x10]
17:39:40.469605 204.174.223.2.22 > 192.168.143.2.1023: P 1:85(84) ack 80 win 32120 <nop,nop,timestamp 144517762 2446022> (DF) [tos 0x10]
17:39:40.489253 192.168.143.2.1023 > 204.174.223.2.22: . ack 85 win 31856 <nop,nop,timestamp 2446035 144517762> (DF) [tos 0x10]

4 packets received by filter
0 packets dropped by kernel
device eth0 left promiscuous mode

<At this point I thought it was unplugged, so I went to the other console
and pressed an arrow key. The cursor had moved from before, but there
was no response to my new keypress.>

[sroot@oof:/root]# netstat -a -n | grep :22 | grep :1023
tcp 0 20 192.168.143.2:1023 204.174.223.2:22 ESTABLISHED

<So the 80 bytes were acked, but my new 20 bytes created by the arrow key
haven't been acked yet. I press another arrow key to see if it's
unplugged: no. I run tcpdump to see if it's getting transmitted...>

[sroot@oof:/root]# tcpdump -i eth0 -n port 1023 and port 22
device eth0 entered promiscuous mode
tcpdump: listening on eth0

17:41:01.885555 204.174.223.2.22 > 192.168.143.2.1023: P 2073415730:2073415758(28) ack 2085755589 win 32120 <nop,nop,timestamp 144525902 2448167> (DF) [tos 0x10]
17:41:01.899275 192.168.143.2.1023 > 204.174.223.2.22: . ack 28 win 31856 <nop,nop,timestamp 2454176 144525902> (DF) [tos 0x10]
17:41:17.739253 192.168.143.2.1023 > 204.174.223.2.22: P 1:41(40) ack 28 win 31856 <nop,nop,timestamp 2455760 144525902> (DF) [tos 0x10]
<It finally got transmitted!>
17:41:17.850252 204.174.223.2.22 > 192.168.143.2.1023: . ack 41 win 32120 nop,nop,timestamp 144527503 2455760> (DF) [tos 0x10]
17:41:17.850727 204.174.223.2.22 > 192.168.143.2.1023: P 28:72(44) ack 41 win 32120 <nop,nop,timestamp 144527503 2455760> (DF) [tos 0x10]
17:41:17.869272 192.168.143.2.1023 > 204.174.223.2.22: . ack 72 win 31856 <nop,nop,timestamp 2455773 144527503> (DF) [tos 0x10]

<This time I suspend tcpdump so that I shouldn't miss any packets. I
press the arrow key twice again.>

Suspended
[1] + Suspended tcpdump -i eth0 -n port 1023 and port 22
[sroot@oof:/root]# netstat -a -n | grep :22 | grep :1023
tcp 0 40 192.168.143.2:1023 204.174.223.2:22 ESTABLISHED
[sroot@oof:/root]# %
tcpdump -i eth0 -n port 1023 and port 22

<Nothing. Perhaps the timer output will show something useful...>

[sroot@oof:/root]# netstat -a -n -o | grep :22 | grep :1023
tcp 0 40 192.168.143.2:1023 204.174.223.2:22 ESTABLISHED on2 (360.72/0/0)
[sroot@oof:/root]# %
tcpdump -i eth0 -n port 1023 and port 22

<Still not much. Finally, it gets sent and acked almost immediately, as
bove.>

This goes on for about 6 or 7 more times (over a period of about 5
minutes), and just recently it went back to normal -- transmitted stuff
actually gets sent in the same second as when I type it. The connection
is back to normal now, but this after at least 6 or 7 new ACKs have
arrived, and so it should have been able to collapse the round trip time
estimate down enough to transmit out new data relatively normally again
(or something, I'm not very familiar with the internals of TCP). Some
timing value is somehow growing huge and is taking a long time to get
back to a reasonable value. Surely a connection open for several hours
shouldn't suddenly get an average round trip time of over 10 seconds,
though, should it?

Simon-

| Simon Kirby | Systems Administration |
| mailto:sim@netnation.com | NetNation Communications |
| http://www.netnation.com/ | Tech: (604) 684-6892 |

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/