Re: e1000 full-duplex TCP performance well below wire speed
From: Bruce Allen
Date: Wed Jan 30 2008 - 17:25:39 EST
Hi Stephen,
Thanks for your helpful reply and especially for the literature pointers.
Indeed, we are not asking to see 1000 Mb/s. We'd be happy to see 900
Mb/s.
Netperf is trasmitting a large buffer in MTU-sized packets (min 1500
bytes). Since the acks are only about 60 bytes in size, they should be
around 4% of the total traffic. Hence we would not expect to see more
than 960 Mb/s.
Don't forget the network overhead: http://sd.wareonearth.com/~phil/net/overhead/
Max TCP Payload data rates over ethernet:
(1500-40)/(38+1500) = 94.9285 % IPv4, minimal headers
(1500-52)/(38+1500) = 94.1482 % IPv4, TCP timestamps
Yes. If you look further down the page, you will see that with jumbo
frames (which we have also tried) on Gb/s ethernet the maximum throughput
is:
(9000-20-20-12)/(9000+14+4+7+1+12)*1000000000/1000000 = 990.042 Mbps
We are very far from this number -- averaging perhaps 600 or 700 Mbps.
I believe what you are seeing is an effect that occurs when using
cubic on links with no other idle traffic. With two flows at high speed,
the first flow consumes most of the router buffer and backs off gradually,
and the second flow is not very aggressive. It has been discussed
back and forth between TCP researchers with no agreement, one side
says that it is unfairness and the other side says it is not a problem in
the real world because of the presence of background traffic.
At least in principle, we should have NO congestion here. We have ports
on two different machines wired with a crossover cable. Box A can not
transmit faster than 1 Gb/s. Box B should be able to receive that data
without dropping packets. It's not doing anything else!
See:
http://www.hamilton.ie/net/pfldnet2007_cubic_final.pdf
http://www.csc.ncsu.edu/faculty/rhee/Rebuttal-LSM-new.pdf
This is extremely helpful. The typical oscillation (startup) period shown
in the plots in these papers is of order 10 seconds, which is similar to
the types of oscillation periods that we are seeing.
*However* we have also seen similar behavior with the Reno congestion
control algorithm. So this might not be due to cubic, or entirely due to
cubic.
In our application (cluster computing) we use a very tightly coupled
high-speed low-latency network. There is no 'wide area traffic'. So it's
hard for me to understand why any networking components or software layers
should take more than milliseconds to ramp up or back off in speed.
Perhaps we should be asking for a TCP congestion avoidance algorithm which
is designed for a data center environment where there are very few hops
and typical packet delivery times are tens or hundreds of microseconds.
It's very different than delivering data thousands of km across a WAN.
Cheers,
Bruce
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/