Re: [PATCH net-next] net: dsa: add GRO support via gro_cells

From: Florian Fainelli
Date: Mon Apr 06 2020 - 16:16:17 EST




On 4/6/2020 12:11 PM, Alexander Lobakin wrote:
> 06.04.2020, 20:57, "Florian Fainelli" <f.fainelli@xxxxxxxxx>:
>> On 4/6/2020 10:34 AM, Alexander Lobakin wrote:
>>> Â06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@xxxxxxxxx>:
>>>> Â06.04.2020, 17:48, "Andrew Lunn" <andrew@xxxxxxx>:
>>>>> ÂÂOn Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>>>>> ÂÂÂgro_cells lib is used by different encapsulating netdevices, such as
>>>>>> ÂÂÂgeneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>>>>> ÂÂÂCPU tag is a sort of "encapsulation", and we can use the same mechs to
>>>>>> ÂÂÂgreatly improve overall DSA performance.
>>>>>> ÂÂÂskbs are passed to the GRO layer after removing CPU tags, so we don't
>>>>>> ÂÂÂneed any new packet offload types as it was firstly proposed by me in
>>>>>> ÂÂÂthe first GRO-over-DSA variant [1].
>>>>>>
>>>>>> ÂÂÂThe size of struct gro_cells is sizeof(void *), so hot struct
>>>>>> ÂÂÂdsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>>>>> ÂÂÂremain in one 32-byte cacheline.
>>>>>> ÂÂÂThe other positive side effect is that drivers for network devices
>>>>>> ÂÂÂthat can be shipped as CPU ports of DSA-driven switches can now use
>>>>>> ÂÂÂnapi_gro_frags() to pass skbs to kernel. Packets built that way are
>>>>>> ÂÂÂcompletely non-linear and are likely being dropped without GRO.
>>>>>>
>>>>>> ÂÂÂThis was tested on to-be-mainlined-soon Ethernet driver that uses
>>>>>> ÂÂÂnapi_gro_frags(), and the overall performance was on par with the
>>>>>> ÂÂÂvariant from [1], sometimes even better due to minimal overhead.
>>>>>> ÂÂÂnet.core.gro_normal_batch tuning may help to push it to the limit
>>>>>> ÂÂÂon particular setups and platforms.
>>>>>>
>>>>>> ÂÂÂ[1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@xxxxxxxx/
>>>>>
>>>>> ÂÂHi Alexander
>>>>
>>>> ÂHi Andrew!
>>>>
>>>>> ÂÂnet-next is closed at the moment. So you should of posted this with an
>>>>> ÂÂRFC prefix.
>>>>
>>>> ÂI saw that it's closed, but didn't knew about "RFC" tags for that period,
>>>> Âsorry.
>>>>
>>>>> ÂÂThe implementation looks nice and simple. But it would be nice to have
>>>>> ÂÂsome performance figures.
>>>>
>>>> ÂI'll do, sure. I think I'll collect the stats with various main receiving
>>>> Âfunctions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
>>>> Ânetif_receive_skb(), netif_receive_skb_list()), and with and without this
>>>> Âpatch to make them as complete as possible.
>>>
>>> ÂOK, so here we go.
>>>
>>> ÂMy device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
>>> Âthe CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
>>> ÂTests are performed through simple IPoE VLAN NAT forwarding setup
>>> Â(port0 <-> port1.218) with iperf3 in TCP mode.
>>> Ânet.core.gro_normal_batch is always set to 16 as that value seems to be
>>> Âthe most effective for that particular hardware and drivers.
>>>
>>> ÂPacket counters on eth0 are the real numbers of ongoing frames. Counters
>>> Âon portX are pure-software and are updated inside networking stack.
>>>
>>> Â---------------------------------------------------------------------
>>>
>>> Ânetif_receive_skb() in Eth driver, no patch:
>>>
>>> Â[ ID] Interval Transfer Bitrate Retr
>>> Â[ 5] 0.00-120.01 sec 9.00 GBytes 644 Mbits/sec 413 sender
>>> Â[ 5] 0.00-120.00 sec 8.99 GBytes 644 Mbits/sec receiver
>>>
>>> Âeth0
>>> ÂRX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport0
>>> ÂRX packets:426050 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1
>>> ÂRX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
>>> ÂTX packets:425862 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1.218
>>> ÂRX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:425851 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Â---------------------------------------------------------------------
>>>
>>> Ânetif_receive_skb_list() in Eth driver, no patch:
>>>
>>> Â[ ID] Interval Transfer Bitrate Retr
>>> Â[ 5] 0.00-120.01 sec 9.48 GBytes 679 Mbits/sec 129 sender
>>> Â[ 5] 0.00-120.00 sec 9.48 GBytes 679 Mbits/sec receiver
>>>
>>> Âeth0
>>> ÂRX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport0
>>> ÂRX packets:416115 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1
>>> ÂRX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:415941 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1.218
>>> ÂRX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:415930 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Â---------------------------------------------------------------------
>>>
>>> Ânapi_gro_receive() in Eth driver, no patch:
>>>
>>> Â[ ID] Interval Transfer Bitrate Retr
>>> Â[ 5] 0.00-120.01 sec 10.0 GBytes 718 Mbits/sec 107 sender
>>> Â[ 5] 0.00-120.00 sec 10.0 GBytes 718 Mbits/sec receiver
>>>
>>> Âeth0
>>> ÂRX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport0
>>> ÂRX packets:429082 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1
>>> ÂRX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:428913 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1.218
>>> ÂRX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:428902 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Â=====================================================================
>>>
>>> Ânetif_receive_skb() in Eth driver + patch:
>>>
>>> Â[ ID] Interval Transfer Bitrate Retr
>>> Â[ 5] 0.00-120.01 sec 12.2 GBytes 870 Mbits/sec 2267 sender
>>> Â[ 5] 0.00-120.00 sec 12.2 GBytes 870 Mbits/sec receiver
>>>
>>> Âeth0
>>> ÂRX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport0
>>> ÂRX packets:455200 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:353288 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1
>>> ÂRX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:455035 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1.218
>>> ÂRX packets:353144 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:455024 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Â---------------------------------------------------------------------
>>>
>>> Ânetif_receive_skb_list() in Eth driver + patch:
>>>
>>> Â[ ID] Interval Transfer Bitrate Retr
>>> Â[ 5] 0.00-120.01 sec 11.6 GBytes 827 Mbits/sec 2224 sender
>>> Â[ 5] 0.00-120.00 sec 11.5 GBytes 827 Mbits/sec receiver
>>>
>>> Âeth0
>>> ÂRX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:898187 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport0
>>> ÂRX packets:436159 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:335665 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1
>>> ÂRX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:436071 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1.218
>>> ÂRX packets:335593 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:436065 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Â-----------------------------------------------------------
>>>
>>> Ânapi_gro_receive() in Eth driver + patch:
>>>
>>> Â[ ID] Interval Transfer Bitrate Retr
>>> Â[ 5] 0.00-120.01 sec 11.8 GBytes 855 Mbits/sec 122 sender
>>> Â[ 5] 0.00-120.00 sec 11.8 GBytes 855 Mbits/sec receiver
>>>
>>> Âeth0
>>> ÂRX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport0
>>> ÂRX packets:438516 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:347236 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1
>>> ÂRX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:438331 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Âport1.218
>>> ÂRX packets:347082 errors:0 dropped:0 overruns:0 frame:0
>>> ÂTX packets:438320 errors:0 dropped:0 overruns:0 carrier:0
>>>
>>> Â-----------------------------------------------------------
>>>
>>> ÂThe main goal is achieved: we have about 100-200 Mbps of performance
>>> Âboost while in-stack skbs are greatly reduced from ~8-9 millions to
>>> Â~350000 (compare port0 TX and port1 RX without patch and with it).
>>
>> And the number of TCP retries is also lower, which likely means that we
>> are making better use of the flow control built into the hardware/driver
>> here?
>>
>> BTW do you know why you have so many retries though? It sounds like your
>> flow control is missing a few edge cases, or that you have an incorrect
>> configuration of your TX admission queue.
>
> Well, I have the same question TBH. All these ~1.5 years that I'm
> working on these switches I have pretty chaotic number of TCP
> retransmissions each time I change something in the code. They are
> less likely to happen when the average CPU load is lower, but ~100
> is the best result I ever got.
> Seems like I should stop trying to push software throughput to
> the max for a while and pay more attention to this and to hardware
> configuration instead and check if I miss something :)

I have had to debug such a problem on some of our systems recently and
it came down to being a couple of things for those systems:

- as a receiver, we could create fast re-transmissions on the sender
side because of packet loss which was because the switch is able to push
packets faster than the DSA master being able to write them to DRAM. One
way to work around this is to clock the Ethernet MAC higher, at the cost
of power consumption.

- as a sender, we could have fast re-transmissions when we were
ourselves a "fast" CPU (1.7GHz or higher for Gigabit throughput), that
part is still being root caused, but I think it comes down to flow
control being incorrectly set-up in hardware, which means you could lose
packets between your ndo_start_xmit() and not having the software TXQ
assert XON/XOFF properly

So in both cases, packet loss is responsible for those fast
re-transmissions, but they are barely observable (case #1 was, since the
switch port counter did not match the Ethernet MAC MIB counters) since
you have a black hole effect.
--
Florian