mss negotiation and path mtu discovery mostly broken?

From: Ristuccia, Brian
Date: Wed Apr 25 2007 - 09:29:24 EST

I'm seeing a problem where the kernel attempts to send packets with a
MSS larger than the one negotiated when the TCP connection is
established. Even after ICMP "can't fragment" messages arrive, the
kernel still attempts to increase the MSS rather aggressively. The end
result is extremely poor throughput when sending to a network with a
smaller MTU.

In /proc/sys/net/ipv4:

The sending host ( has an MTU of 9000. The destination host
( has an MTU of 1500. There is one router between the hosts
which will drop packets with the "DF" flag when they don't fit the
destination interface's MTU and generates the required icmp can't
fragment message.

The dump shows the initial handshake with correct mss options sent:

08:39:55.493029 IP > S
0) win 5840 <mss 1460,sackOK,timestamp 3873837730 0,nop,wscale 2>
08:39:55.493119 IP > S
ack 2768979374 win 17896 <mss 8960,sackOK,timestamp 413751
e 5>

Then I see the system send larger packets (larger than the mss),
provoking a "can't fragment" from the router. Now I suppose it might be
reasonable to occasionally probe a larger MSS when the current MSS is a
result of reductions due to path mtu discovery. After all, the path
taken could change over time. But when the current MSS is at the value
negotiated by the MSS option during the TCP handshake, it seems like
there's no sense in trying to send with a lager MSS. Even if there were,
there's certainly no justification for making such an attempt every
other packet (2.6.18) or every fourth packet (

In the following dump, the system eventually gets in a state where it
oscillates between sendng undeliverable 2896 byte packets and
deliverable 1448 byte ones.

08:39:55.649689 IP > .
5906:10250(4344) ack 1
794 win 674 <nop,nop,timestamp 413790 3873837887>
08:39:55.650532 IP > icmp 92:
unreachable -
need to frag (mtu 1500)
08:39:55.689774 IP > . ack 5906 win
4544 <nop
,nop,timestamp 3873837927 413790>
08:39:55.689784 IP > .
10250:13146(2896) ack
1794 win 674 <nop,nop,timestamp 413800 3873837927>
08:39:55.690497 IP > icmp 92:
unreachable -
need to frag (mtu 1500)
08:39:55.902494 IP > .
5906:7354(1448) ack 17
94 win 674 <nop,nop,timestamp 413853 3873837927>

Since any sane router will only generate can't fragment ICMP's at a
limited rate, for two hosts on gigabit ethernet, one on a MTU 1500
subnet and another on a MTU 9000 subnet, I can move only 40-50KB/sec
over an affected TCP connection.

I was unable to find any reference to this problem in the kernel
changelogs, or even any reports of anyone else having a similar problem.
The above dumps are from I could also reproduce the problem in
2.6.18, although the dumps looked slightly different. I was unable to
reproduce this problem with the 2.6.9-42.0.10.ELsmp kernel which ships
in RHEL4.

I can send a pcap dump to anyone interested.

Brian Ristuccia

"This email message and any attachments are confidential information of Starent Networks, Corp. The information transmitted may not be used to create or change any contractual obligations of Starent Networks, Corp. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon this e-mail and its attachments by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify the sender immediately -- by replying to this message or by sending an email to postmaster@xxxxxxxxxxxxxxxxxxx -- and destroy all copies of this message and any attachments without reading or disclosing their contents. Thank you."
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/