Re: [PATCH net-next] net: dsa: add GRO support via gro_cells

From: Florian Fainelli
Date: Mon Apr 06 2020 - 13:57:24 EST




On 4/6/2020 10:34 AM, Alexander Lobakin wrote:
> 06.04.2020, 18:21, "Alexander Lobakin" <bloodyreaper@xxxxxxxxx>:
>> 06.04.2020, 17:48, "Andrew Lunn" <andrew@xxxxxxx>:
>>> ÂOn Mon, Apr 06, 2020 at 01:59:10PM +0300, Alexander Lobakin wrote:
>>>> ÂÂgro_cells lib is used by different encapsulating netdevices, such as
>>>> ÂÂgeneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
>>>> ÂÂCPU tag is a sort of "encapsulation", and we can use the same mechs to
>>>> ÂÂgreatly improve overall DSA performance.
>>>> ÂÂskbs are passed to the GRO layer after removing CPU tags, so we don't
>>>> ÂÂneed any new packet offload types as it was firstly proposed by me in
>>>> ÂÂthe first GRO-over-DSA variant [1].
>>>>
>>>> ÂÂThe size of struct gro_cells is sizeof(void *), so hot struct
>>>> ÂÂdsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
>>>> ÂÂremain in one 32-byte cacheline.
>>>> ÂÂThe other positive side effect is that drivers for network devices
>>>> ÂÂthat can be shipped as CPU ports of DSA-driven switches can now use
>>>> ÂÂnapi_gro_frags() to pass skbs to kernel. Packets built that way are
>>>> ÂÂcompletely non-linear and are likely being dropped without GRO.
>>>>
>>>> ÂÂThis was tested on to-be-mainlined-soon Ethernet driver that uses
>>>> ÂÂnapi_gro_frags(), and the overall performance was on par with the
>>>> ÂÂvariant from [1], sometimes even better due to minimal overhead.
>>>> ÂÂnet.core.gro_normal_batch tuning may help to push it to the limit
>>>> ÂÂon particular setups and platforms.
>>>>
>>>> ÂÂ[1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@xxxxxxxx/
>>>
>>> ÂHi Alexander
>>
>> Hi Andrew!
>>
>>> Ânet-next is closed at the moment. So you should of posted this with an
>>> ÂRFC prefix.
>>
>> I saw that it's closed, but didn't knew about "RFC" tags for that period,
>> sorry.
>>
>>> ÂThe implementation looks nice and simple. But it would be nice to have
>>> Âsome performance figures.
>>
>> I'll do, sure. I think I'll collect the stats with various main receiving
>> functions in Ethernet driver (napi_gro_frags(), napi_gro_receive(),
>> netif_receive_skb(), netif_receive_skb_list()), and with and without this
>> patch to make them as complete as possible.
>
> OK, so here we go.
>
> My device is 1.2 GHz 4-core MIPS32 R2. Ethernet controller representing
> the CPU port is capable of S/G, fraglists S/G, TSO4/6 and GSO UDP L4.
> Tests are performed through simple IPoE VLAN NAT forwarding setup
> (port0 <-> port1.218) with iperf3 in TCP mode.
> net.core.gro_normal_batch is always set to 16 as that value seems to be
> the most effective for that particular hardware and drivers.
>
> Packet counters on eth0 are the real numbers of ongoing frames. Counters
> on portX are pure-software and are updated inside networking stack.
>
> ---------------------------------------------------------------------
>
> netif_receive_skb() in Eth driver, no patch:
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-120.01 sec 9.00 GBytes 644 Mbits/sec 413 sender
> [ 5] 0.00-120.00 sec 8.99 GBytes 644 Mbits/sec receiver
>
> eth0
> RX packets:7097731 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7097702 errors:0 dropped:0 overruns:0 carrier:0
>
> port0
> RX packets:426050 errors:0 dropped:0 overruns:0 frame:0
> TX packets:6671829 errors:0 dropped:0 overruns:0 carrier:0
>
> port1
> RX packets:6671681 errors:0 dropped:0 overruns:0 carrier:0
> TX packets:425862 errors:0 dropped:0 overruns:0 carrier:0
>
> port1.218
> RX packets:6671677 errors:0 dropped:0 overruns:0 frame:0
> TX packets:425851 errors:0 dropped:0 overruns:0 carrier:0
>
> ---------------------------------------------------------------------
>
> netif_receive_skb_list() in Eth driver, no patch:
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-120.01 sec 9.48 GBytes 679 Mbits/sec 129 sender
> [ 5] 0.00-120.00 sec 9.48 GBytes 679 Mbits/sec receiver
>
> eth0
> RX packets:7448098 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7448073 errors:0 dropped:0 overruns:0 carrier:0
>
> port0
> RX packets:416115 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7032121 errors:0 dropped:0 overruns:0 carrier:0
>
> port1
> RX packets:7031983 errors:0 dropped:0 overruns:0 frame:0
> TX packets:415941 errors:0 dropped:0 overruns:0 carrier:0
>
> port1.218
> RX packets:7031978 errors:0 dropped:0 overruns:0 frame:0
> TX packets:415930 errors:0 dropped:0 overruns:0 carrier:0
>
> ---------------------------------------------------------------------
>
> napi_gro_receive() in Eth driver, no patch:
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-120.01 sec 10.0 GBytes 718 Mbits/sec 107 sender
> [ 5] 0.00-120.00 sec 10.0 GBytes 718 Mbits/sec receiver
>
> eth0
> RX packets:7868281 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7868267 errors:0 dropped:0 overruns:0 carrier:0
>
> port0
> RX packets:429082 errors:0 dropped:0 overruns:0 frame:0
> TX packets:7439343 errors:0 dropped:0 overruns:0 carrier:0
>
> port1
> RX packets:7439199 errors:0 dropped:0 overruns:0 frame:0
> TX packets:428913 errors:0 dropped:0 overruns:0 carrier:0
>
> port1.218
> RX packets:7439195 errors:0 dropped:0 overruns:0 frame:0
> TX packets:428902 errors:0 dropped:0 overruns:0 carrier:0
>
> =====================================================================
>
> netif_receive_skb() in Eth driver + patch:
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-120.01 sec 12.2 GBytes 870 Mbits/sec 2267 sender
> [ 5] 0.00-120.00 sec 12.2 GBytes 870 Mbits/sec receiver
>
> eth0
> RX packets:9474792 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9474777 errors:0 dropped:0 overruns:0 carrier:0
>
> port0
> RX packets:455200 errors:0 dropped:0 overruns:0 frame:0
> TX packets:353288 errors:0 dropped:0 overruns:0 carrier:0
>
> port1
> RX packets:9019592 errors:0 dropped:0 overruns:0 frame:0
> TX packets:455035 errors:0 dropped:0 overruns:0 carrier:0
>
> port1.218
> RX packets:353144 errors:0 dropped:0 overruns:0 frame:0
> TX packets:455024 errors:0 dropped:0 overruns:0 carrier:0
>
> ---------------------------------------------------------------------
>
> netif_receive_skb_list() in Eth driver + patch:
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-120.01 sec 11.6 GBytes 827 Mbits/sec 2224 sender
> [ 5] 0.00-120.00 sec 11.5 GBytes 827 Mbits/sec receiver
>
> eth0
> RX packets:8981651 errors:0 dropped:0 overruns:0 frame:0
> TX packets:898187 errors:0 dropped:0 overruns:0 carrier:0
>
> port0
> RX packets:436159 errors:0 dropped:0 overruns:0 frame:0
> TX packets:335665 errors:0 dropped:0 overruns:0 carrier:0
>
> port1
> RX packets:8545492 errors:0 dropped:0 overruns:0 frame:0
> TX packets:436071 errors:0 dropped:0 overruns:0 carrier:0
>
> port1.218
> RX packets:335593 errors:0 dropped:0 overruns:0 frame:0
> TX packets:436065 errors:0 dropped:0 overruns:0 carrier:0
>
> -----------------------------------------------------------
>
> napi_gro_receive() in Eth driver + patch:
>
> [ ID] Interval Transfer Bitrate Retr
> [ 5] 0.00-120.01 sec 11.8 GBytes 855 Mbits/sec 122 sender
> [ 5] 0.00-120.00 sec 11.8 GBytes 855 Mbits/sec receiver
>
> eth0
> RX packets:9292214 errors:0 dropped:0 overruns:0 frame:0
> TX packets:9292190 errors:0 dropped:0 overruns:0 carrier:0
>
> port0
> RX packets:438516 errors:0 dropped:0 overruns:0 frame:0
> TX packets:347236 errors:0 dropped:0 overruns:0 carrier:0
>
> port1
> RX packets:8853698 errors:0 dropped:0 overruns:0 frame:0
> TX packets:438331 errors:0 dropped:0 overruns:0 carrier:0
>
> port1.218
> RX packets:347082 errors:0 dropped:0 overruns:0 frame:0
> TX packets:438320 errors:0 dropped:0 overruns:0 carrier:0
>
> -----------------------------------------------------------
>
> The main goal is achieved: we have about 100-200 Mbps of performance
> boost while in-stack skbs are greatly reduced from ~8-9 millions to
> ~350000 (compare port0 TX and port1 RX without patch and with it).

And the number of TCP retries is also lower, which likely means that we
are making better use of the flow control built into the hardware/driver
here?

BTW do you know why you have so many retries though? It sounds like your
flow control is missing a few edge cases, or that you have an incorrect
configuration of your TX admission queue.

>
> The main bottleneck in gro_cells setup is that GRO layer starts to
> work only after skb are being processed by DSA stack, so they are
> going frame-by-frame until that moment (RX counter on port1).
>
> If one day we change the way of handling incoming packets (not
> through fake packet_type), we could avoid that by unblocking GRO
> processing in between Eth driver and DSA core.
> With my custom packet_offload for ETH_P_XDSA that works only for
> my CPU tag format I have about ~910-920 Mbps on the same platform.
> This way doesn't fit mainline code of course, so I'm working on
> alternative Rx paths for DSA, e.g. through net_device::rx_handler()
> etc.
>
> Until then, gro_cells really improve things a lot while the actual
> patch is tiny.
>
--
Florian