Re: [PATCH v4 net-next 2/3] net/udp: Add 4-tuple hash list basis

From: Philo Lu
Date: Thu Oct 17 2024 - 03:47:21 EST




On 2024/10/16 15:45, Paolo Abeni wrote:
On 10/16/24 08:30, Philo Lu wrote:
On 2024/10/14 18:07, Paolo Abeni wrote:
It would be great if you could please share some benchmark showing the
raw max receive PPS performances for unconnected sockets, with and
without this series applied, to ensure this does not cause any real
regression for such workloads.


Tested using sockperf tp with default msgsize (14B), 3 times for w/ and
w/o the patch set, and results show no obvious difference:

[msg/sec]  test1    test2    test3    mean
w/o patch  514,664  519,040  527,115  520.3k
w/  patch  516,863  526,337  527,195  523.5k (+0.6%)

Thank you for review, Paolo.

Are the value in packet per seconds, or bytes per seconds? Are you doing a loopback test or over the wire? The most important question is: is the receiver side keeping (at least) 1 CPU fully busy? Otherwise the test is not very relevant.

It looks like you have some setup issue, or you are using a relatively low end H/W: the expected packet rate for reasonable server H/W is well above 1M (possibly much more than that, but I can't put my hands on recent H/W, so I can't provide a more accurate figure).

A single socket, user-space, UDP sender is usually unable to reach such tput without USO, and even with USO you likely need to do an over-the- wire test to really be able to keep the receiver fully busy. AFAICS sockperf does not support USO for the sender.

You could use the udpgso_bench_tx/udpgso_bench_rx pair from the net selftests directory instead.

Or you could use pktgen as traffic generator.


I test it again with udpgso_bench_tx/udpgso_bench_rx. In server, 2 cpus are involved, one for udpgso_bench_rx and the other for nic rx queue so that the si of nic rx cpu is 100%. udpgso_bench_tx runs with payload size 20, and the tx pps is larger than rx ensuring rx is the bottleneck.

The outputs of udpgso_bench_rx:
[without patchset]
udp rx: 20 MB/s 1092546 calls/s
udp rx: 20 MB/s 1095051 calls/s
udp rx: 20 MB/s 1094136 calls/s
udp rx: 20 MB/s 1098860 calls/s
udp rx: 20 MB/s 1097963 calls/s
udp rx: 20 MB/s 1097460 calls/s
udp rx: 20 MB/s 1098370 calls/s
udp rx: 20 MB/s 1098089 calls/s
udp rx: 20 MB/s 1095330 calls/s
udp rx: 20 MB/s 1095486 calls/s

[with patchset]
udp rx: 21 MB/s 1105533 calls/s
udp rx: 21 MB/s 1105475 calls/s
udp rx: 21 MB/s 1104244 calls/s
udp rx: 21 MB/s 1105600 calls/s
udp rx: 21 MB/s 1108019 calls/s
udp rx: 21 MB/s 1101971 calls/s
udp rx: 21 MB/s 1104147 calls/s
udp rx: 21 MB/s 1104874 calls/s
udp rx: 21 MB/s 1101987 calls/s
udp rx: 21 MB/s 1105500 calls/s

The averages w/ and w/o the patchset are 1104735 and 1096329, the gap is 0.8%, which I think is negligible.

Besides, perf shows ~0.6% higher cpu consumption of __udp4_lib_lookup() with this patchset (increasing from 5.7% to 6.3%).

Thanks.
--
Philo