Re: [PATCH net-next v7 2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

From: Richard Gobert
Date: Thu Apr 18 2024 - 11:06:23 EST




Paolo Abeni wrote:
> On Tue, 2024-04-16 at 11:21 +0200, Paolo Abeni wrote:
>> On Fri, 2024-04-12 at 17:55 +0200, Richard Gobert wrote:
>>> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
>>> iph->id, ...) against all packets in a loop. These flush checks are used
>>> currently in all tcp flows and in some UDP flows in GRO.
>>>
>>> These checks need to be done only once and only against the found p skb,
>>> since they only affect flush and not same_flow.
>>>
>>> Leveraging the previous commit in the series, in which correct network
>>> header offsets are saved for both outer and inner network headers -
>>> allowing these checks to be done only once, in tcp_gro_receive and
>>> udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at
>>> all. In addition, flush_id checks are more declarative and contained in
>>> inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
>>>
>>> This results in less parsing code for UDP flows and non-loop flush tests
>>> for TCP flows.
>>>
>>> To make sure results are not within noise range - I've made netfilter drop
>>> all TCP packets, and measured CPU performance in GRO (in this case GRO is
>>> responsible for about 50% of the CPU utilization).
>>>
>>> L3 flush/flush_id checks are not relevant to UDP connections where
>>> skb_gro_receive_list is called. The only code change relevant to this flow
>>> is inet_gro_receive. The rest of the code parsing this flow stays the
>>> same.
>>>
>>> All concurrent connections tested are with the same ip srcaddr and
>>> dstaddr.
>>>
>>> perf top while replaying 64 concurrent IP/UDP connections (UDP fwd flow):
>>> net-next:
>>> 3.03% [kernel] [k] inet_gro_receive
>>>
>>> patch applied:
>>> 2.78% [kernel] [k] inet_gro_receive
>>
>> Why there are no figures for
>> udp_gro_receive_segment()/gro_network_flush() here?
>>
>> Also you should be able to observer a very high amount of CPU usage by
>> GRO even with TCP with very high speed links, keeping the BH/GRO on a
>> CPU and the user-space/data copy on a different one (or using rx zero
>> copy).
>
> To be more explicit: I think at least the above figures are required, 
> and I still fear the real gain in that case would range from zero to
> negative.
>

I wrote about it in the commit message in short, sorry if I wasn't clear
enough.

gro_network_flush is compiled in-line to both udp_gro_receive_segment and
tcp_gro_receive. udp_gro_receive_segment is compiled in-line to
udp_gro_receive.

The UDP numbers I posted are not relevant anymore after Willem and
Alexander's thread, after which we understood flush and flush_id should be
calculated for all UDP flows.

I can post new numbers for the UDP fwd path after implementing the correct
change. As for TCP - the numbers I posted stay the same.

You should note there is an increase in CPU utilization in tcp_gro_receive
because of the inline compilation of gro_network_flush. The numbers make
sense and show performance enhancement in the case I showed when both
inet_gro_receive and tcp_gro_receive are accounted for.

> If you can't do the TCP part of the testing, please provide at least
> the figures for a single UDP flow, that should give more indication WRT
> the result we can expect with TCP.
>
> Note that GRO is used mainly by TCP and TCP packets with different
> src/dst port will land into different GRO hash buckets, having
> different RX hash.
>
> That will happen even for UDP, at least for some (most?) nics include
> the UDP ports in the RX hash.
>
> Thanks,
>
> Paolo
>