Re: [PATCH net-next v7 2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment
From: Paolo Abeni
Date: Tue Apr 16 2024 - 05:58:15 EST
On Tue, 2024-04-16 at 11:21 +0200, Paolo Abeni wrote:
> On Fri, 2024-04-12 at 17:55 +0200, Richard Gobert wrote:
> > {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> > iph->id, ...) against all packets in a loop. These flush checks are used
> > currently in all tcp flows and in some UDP flows in GRO.
> >
> > These checks need to be done only once and only against the found p skb,
> > since they only affect flush and not same_flow.
> >
> > Leveraging the previous commit in the series, in which correct network
> > header offsets are saved for both outer and inner network headers -
> > allowing these checks to be done only once, in tcp_gro_receive and
> > udp_gro_receive_segment. As a result, NAPI_GRO_CB(p)->flush is not used at
> > all. In addition, flush_id checks are more declarative and contained in
> > inet_gro_flush, thus removing the need for flush_id in napi_gro_cb.
> >
> > This results in less parsing code for UDP flows and non-loop flush tests
> > for TCP flows.
> >
> > To make sure results are not within noise range - I've made netfilter drop
> > all TCP packets, and measured CPU performance in GRO (in this case GRO is
> > responsible for about 50% of the CPU utilization).
> >
> > L3 flush/flush_id checks are not relevant to UDP connections where
> > skb_gro_receive_list is called. The only code change relevant to this flow
> > is inet_gro_receive. The rest of the code parsing this flow stays the
> > same.
> >
> > All concurrent connections tested are with the same ip srcaddr and
> > dstaddr.
> >
> > perf top while replaying 64 concurrent IP/UDP connections (UDP fwd flow):
> > net-next:
> > 3.03% [kernel] [k] inet_gro_receive
> >
> > patch applied:
> > 2.78% [kernel] [k] inet_gro_receive
>
> Why there are no figures for
> udp_gro_receive_segment()/gro_network_flush() here?
>
> Also you should be able to observer a very high amount of CPU usage by
> GRO even with TCP with very high speed links, keeping the BH/GRO on a
> CPU and the user-space/data copy on a different one (or using rx zero
> copy).
To be more explicit: I think at least the above figures are required,
and I still fear the real gain in that case would range from zero to
negative.
If you can't do the TCP part of the testing, please provide at least
the figures for a single UDP flow, that should give more indication WRT
the result we can expect with TCP.
Note that GRO is used mainly by TCP and TCP packets with different
src/dst port will land into different GRO hash buckets, having
different RX hash.
That will happen even for UDP, at least for some (most?) nics include
the UDP ports in the RX hash.
Thanks,
Paolo