Re: Tx TCP rates down > 20% - A report.

Paul Gortmaker (gpg109@rsphy6.anu.edu.au)
Thu, 2 May 1996 20:11:58 +1000 (EST)


Linus Torvalds <torvalds@cs.Helsinki.FI>
Thu, 2 May 1996 07:06:42 +0300 (EET DST)

> There _does_ seem to be some bad effects with the drivers under some
> circumstances, though. Notably, the "tbusy" handling in the ethernet driver
> interface looks like it's pretty broken - it's used for two things: (a)
> serializing the ethernet driver (which was the original reason for it, but is
> unnecessary these days when the network layer makes sure it's all serialized
> anyway) and (b) as a send throttle to tell the network layer that the
> card is busy.

Yes, but note that the 8390 drivers (via the 8390.c core) haven't used
the dev->tbusy for (a) since I removed it in early 1.3.x -- my testing
with TTCP only involves 8390 cards. I cleaned up the dev->tbusy handling
of all the other drivers as well, but those patches got lost in the noise
of early 1.3.x net changes and never made it in.

> The (b) case is the only thing it does any more, and I suspect it is also
> the thing that makes you see bad performance. The TCP side is much faster
> in the later 1.3.x kernels, and the network cards can no longer keep up
> so the throttle is essentially in effect _all_ the time. What you see is
> probably due to:
>
> - TCP layer has a few packets queued up, sends one to the network driver
> - network driver puts out the packet, sets tbusy
> - TCP layer sees tbusy, and doesn't send any more
> - network driver gets a "tx complete interrupt" and does a callback to
> net layer with mark_bh(NET_BH), and the cycle starts up again..
>
> Essentially, the tbusy thing may result in a _single_ packet being sent
> and then we go away and come back only next time around. Broken, broken,
> broken. I haven't touched it because I don't know the network drivers
> well enough.

Considering the 8390 drivers again, they reserve enough on-card RAM for
two full sized Tx packets. Upon exiting the Tx function, I have it check
to see if both slots are in use, and *only* set tbusy if that is the
case. And a tx done interrupt will clear tbusy (if set) and trigger the
send of the waiting packet in the second slot (if there is one).

So, at least in the case of the 8390 drivers, there should not be single
packet throttling happening.

> In short, the problem is _not_ in the network layer.
>
> The reason 1.2.13 does better is probably two-fold
> - the TCP layer wasn't very fast, so it was entirely possible that the
> driver got the packet send out quickly enough that there wasn't much
> of a throttling effect.
> - the "net_bh()" handler used to do multiple calls to "dev_transmit()".
> You still see that in net/core/dev.t - look at the things that are
> #ifdef'ed with "XMIT_EVERY" and "XMIT_AFTER".

To test your 2nd theory, I re-enabled both extra calls to dev_transmit()
and tried the same test (machine A to B). Absolutely *no* difference.
Both with and without the extra calls, it sits at 826 -> 829kB/s wheras
1.2.13 can send in excess of 1MB/s to the same recipient. So apparently
everything gets pushed out in the first dev_transmit() call in this case.

> As I said, I'm more-or-less certain that the problem is _not_ the packets
> on the wire, but the drivers. They need to be updated a bit - the 3c509
> driver gets reasonable performance because it has a internal packet queue

Before you get too excited about the 3c509, lets look at what it has
and what it does. The original 3c509 has 4kB split 2kB Rx and 2kB Tx.
Note that 2kB can't even hold 2 full sized packets. The 3c509B is
better in that it has 8kB, with a power on default split of 5kB Rx and
3kB Tx -- 3kB will just fit 2 full sized packets. Now after the driver
uploads a packet to the 509, it asks the card if it has at least 1536
bytes free for another one. If yes, it clears tbusy. If not, it leaves
tbusy set until the tx_done interrupt comes along. So if you have a
3c509B, you can fit in two full sized packets, just like an 8390. If you
have the original 3c509, you can only sit on one full sized packet.

> on the card, so the tbusy thing works correctly (instead of throttling
> every packet it throttles maybe every ten packets, which is roughly what
> we'd want.

Once again, the 8390 driver shouldn't be throttling on every packet
either. It can hold two packets via "ping-pong" Tx buffers. If you start
stuffing more than 2 or 3 skb's in the driver, then you start adding more
hysteresis into any sensible flow-control algorithms, as the upper layers
lose control of those packets once the driver does dev_kfree_skb(). Any
priority sorting done by the uuper layers will suffer similarly. Also,
if the card dies on you, forcing a reset or whatever, you can end up
dropping a whole handful of packets on the floor if they are stored up
in the card's RAM. Not to mention that for most cards, you start cutting
into valuable Rx RAM space by upping the number of Tx buffers.

Anyway, I have just finished some tcpdumps of the fast-1.2.13-Tx vs the
slower-1.3.97-Tx from a third box and there are more packets on the wire
with the slow case, just by doing a line count of the logs.

foo:/tmp# zcat fast.gz | wc
46657 453477 3073180
foo:/tmp# zcat slow.gz | wc
60169 586163 3989989
foo:/tmp#

I'll package them up and bounce them across to Alan so he can check
them out, as he requested earlier.

Paul.