Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

From: Jesper Dangaard Brouer
Date: Wed May 22 2024 - 03:10:07 EST




On 17/05/2024 18.15, Sebastian Andrzej Siewior wrote:
On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
Trick for CPU-map to do early drop on remote CPU:

# ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1

I recommend using Ctrl+\ while running to show more info like CPUs being
used and what kthread consumes. To catch issues e.g. if you are CPU
redirecting to same CPU as RX happen to run on.

Okay. So I reworked the last two patches make the struct part of
task_struct and then did as you suggested:

Unpatched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat: 952102520 ( 952,102,520) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14876602 ( 14,876,602) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat: 14876602 ( 14,876,602) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat: 446045897 ( 446,045,897) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat: 7434098 ( 7,434,098) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat: 446556042 ( 446,556,042) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat: 7442601 ( 7,442,601) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat: 892592523 ( 892,592,523) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14876542 ( 14,876,542) <= tx_packets /sec
|Ethtool(eno2np1 ) stat: 2 ( 2) <= tx_restart /sec
|Ethtool(eno2np1 ) stat: 2 ( 2) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat: 14876622 ( 14,876,622) <= tx_unicast /sec
|
|Receive:
|eth1->? 8,732,508 rx/s 0 err,drop/s
| receive total 8,732,508 pkt/s 0 drop/s 0 error/s
| cpu:10 8,732,508 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 8,732,510 pkt/s 0 drop/s 7.00 bulk-avg
| cpu:10->3 8,732,510 pkt/s 0 drop/s 7.00 bulk-avg
| kthread total 8,732,506 pkt/s 0 drop/s 205,650 sched
| cpu:3 8,732,506 pkt/s 0 drop/s 205,650 sched
| xdp_stats 0 pass/s 8,732,506 drop/s 0 redir/s
| cpu:3 0 pass/s 8,732,506 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

I verified that the "drop only" case hits 14M packets/s while this
redirect part reports 8M packets/s.


Great, this is a good test.

The transmit speed 14.88 Mpps is 10G wirespeed at smallest Ethernet
packet size (84 bytes with overhead + intergap, 10*10^9/(84*8) = 14880952).


Patched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat: 952635404 ( 952,635,404) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14884934 ( 14,884,934) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat: 14884928 ( 14,884,928) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat: 446496117 ( 446,496,117) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat: 7441602 ( 7,441,602) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat: 446603461 ( 446,603,461) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat: 7443391 ( 7,443,391) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat: 893086506 ( 893,086,506) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14884775 ( 14,884,775) <= tx_packets /sec
|Ethtool(eno2np1 ) stat: 14 ( 14) <= tx_restart /sec
|Ethtool(eno2np1 ) stat: 14 ( 14) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat: 14884937 ( 14,884,937) <= tx_unicast /sec
|
|Receive:
|eth1->? 8,735,198 rx/s 0 err,drop/s
| receive total 8,735,198 pkt/s 0 drop/s 0 error/s
| cpu:6 8,735,198 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 8,735,193 pkt/s 0 drop/s 7.00 bulk-avg
| cpu:6->3 8,735,193 pkt/s 0 drop/s 7.00 bulk-avg
| kthread total 8,735,191 pkt/s 0 drop/s 208,054 sched
| cpu:3 8,735,191 pkt/s 0 drop/s 208,054 sched
| xdp_stats 0 pass/s 8,735,191 drop/s 0 redir/s
| cpu:3 0 pass/s 8,735,191 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s


Great basically zero overhead. Awesome you verified this!


This looks to be in the same range/ noise level. top wise I have
ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
10G link.

For our purpose of testing XDP_REDIRECT code, that you are modifying,
this is what we want. Where RX CPU/NAPI is the bottleneck, given remote
cpumap CPU have idle cycles (also indicated by the 208,054 sched stats).

perf top shows

I appreciate getting this perf data.

As we are explicitly dealing with splitting workload across CPUs, it
worth mentioning that perf support displaying and filtering on CPUs.

This perf commands include the CPU number (zero indexed):
# perf report --sort cpu,comm,dso,symbol --no-children

For this benchmark, to focus, I would reduce this to:
# perf report --sort cpu,symbol --no-children

The perf tool can also use -C to filter on some CPUs like:

# perf report --sort cpu,symbol --no-children -C 3,6


| 18.37% bpf_prog_4f0ffbb35139c187_cpumap_l4_hash [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash

This bpf_prog_4f0ffbb35139c187_cpumap_l4_hash is running on RX CPU doing the load-balancing.

| 13.15% [kernel] [k] cpu_map_kthread_run

This runs on remote cpumap CPU (in this case CPU 3).

| 12.96% [kernel] [k] ixgbe_poll
| 6.78% [kernel] [k] page_frag_free

The page_frag_free call might run on remote cpumap CPU.

| 5.62% [kernel] [k] xdp_do_redirect

for the top 5. Is this something that looks reasonable?

Yes, except I had to guess how the workload was split between CPUs ;-)

Thanks for doing these benchmarks! :-)
--Jesper