Re: Re: [PATCH bpf-next] bpf: tcp: Improve bpf write tcp opt performance

From: Feng Zhou
Date: Wed Jun 05 2024 - 03:37:24 EST

Next message: Frank Wang: "Re: [PATCH v5 5/7] usb: typec: mux: Allow multiple mux_devs per mux"
Previous message: Alexandre TORGUE: "Re: [PATCH v2 3/3] arm64: dts: st: enable STM32 access controller for RCC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

在 2024/5/31 18:45, Jakub Sitnicki 写道:

On Fri, May 17, 2024 at 03:27 PM +08, Feng Zhou wrote:

在 2024/5/17 01:15, Jakub Sitnicki 写道:

On Thu, May 16, 2024 at 11:15 AM +08, Feng Zhou wrote:

在 2024/5/15 17:48, Jakub Sitnicki 写道:

[...]

If it's not the BPF prog, which you have ruled out, then where are we
burining cycles? Maybe that is something that can be improved.
Also, in terms on quantifying the improvement - it is 20% in terms of
what? Throughput, pps, cycles? And was that a single data point? For
multiple measurements there must be some variance (+/- X pp).
Would be great to see some data to back it up.
[...]

Pressure measurement method:

server: sockperf sr --tcp -i x.x.x.x -p 7654 --daemonize
client: taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30

Default mode, no bpf prog:

taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
sockperf: == version #3.10-23.gited92afb185e6 ==
sockperf[CLIENT] send on:
[ 0] IP = x.x.x.x PORT = 7654 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: Total of 71520808 messages sent in 30.000 sec

sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
throughput consider using --msg-size=1472
sockperf: Summary: Message Rate is 2384000 [msg/sec]
sockperf: Summary: BandWidth is 2728.271 MBps (21826.172 Mbps)

perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
perf report

80.88%--sock_sendmsg
79.53%--tcp_sendmsg
42.48%--tcp_sendmsg_locked
16.23%--_copy_from_iter
4.24%--tcp_send_mss
3.25%--tcp_current_mss

perf top -C 8

19.13% [kernel] [k] _raw_spin_lock_bh
11.75% [kernel] [k] copy_user_enhanced_fast_string
9.86% [kernel] [k] tcp_sendmsg_locked
4.44% sockperf [.]
_Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
4.16% libpthread-2.28.so [.] __libc_sendto
3.85% [kernel] [k] syscall_return_via_sysret
2.70% [kernel] [k] _copy_from_iter
2.48% [kernel] [k] entry_SYSCALL_64
2.33% [kernel] [k] native_queued_spin_lock_slowpath
1.89% [kernel] [k] __virt_addr_valid
1.77% [kernel] [k] __check_object_size
1.75% [kernel] [k] __sys_sendto
1.74% [kernel] [k] entry_SYSCALL_64_after_hwframe
1.42% [kernel] [k] __fget_light
1.28% [kernel] [k] tcp_push
1.01% [kernel] [k] tcp_established_options
0.97% [kernel] [k] tcp_send_mss
0.94% [kernel] [k] syscall_exit_to_user_mode_prepare
0.94% [kernel] [k] tcp_sendmsg
0.86% [kernel] [k] tcp_current_mss

Having bpf prog to write tcp opt in all pkts:

taskset -c 8 sockperf tp --tcp -i x.x.x.x -p 7654 -m 1200 -t 30
sockperf: == version #3.10-23.gited92afb185e6 ==
sockperf[CLIENT] send on:
[ 0] IP = x.x.x.x PORT = 7654 # TCP
sockperf: Warmup stage (sending a few dummy messages)...
sockperf: Starting test...
sockperf: Test end (interrupted by timer)
sockperf: Test ended
sockperf: Total of 60636218 messages sent in 30.000 sec

sockperf: NOTE: test was performed, using msg-size=1200. For getting maximum
throughput consider using --msg-size=1472
sockperf: Summary: Message Rate is 2021185 [msg/sec]
sockperf: Summary: BandWidth is 2313.063 MBps (18504.501 Mbps)

perf record --call-graph fp -e cycles:k -C 8 -- sleep 10
perf report

80.30%--sock_sendmsg
79.02%--tcp_sendmsg
54.14%--tcp_sendmsg_locked
12.82%--_copy_from_iter
12.51%--tcp_send_mss
11.77%--tcp_current_mss
10.10%--tcp_established_options
8.75%--bpf_skops_hdr_opt_len.isra.54
5.71%--__cgroup_bpf_run_filter_sock_ops
3.32%--bpf_prog_e7ccbf819f5be0d0_tcpopt
6.61%--__tcp_push_pending_frames
6.60%--tcp_write_xmit
5.89%--__tcp_transmit_skb

perf top -C 8

10.98% [kernel] [k] _raw_spin_lock_bh
9.04% [kernel] [k] copy_user_enhanced_fast_string
7.78% [kernel] [k] tcp_sendmsg_locked
3.91% sockperf [.]
_Z14client_handlerI10IoRecvfrom9SwitchOff13PongModeNeverEviii
3.46% libpthread-2.28.so [.] __libc_sendto
3.35% [kernel] [k] syscall_return_via_sysret
2.86% [kernel] [k] bpf_skops_hdr_opt_len.isra.54
2.16% [kernel] [k] __htab_map_lookup_elem
2.11% [kernel] [k] _copy_from_iter
2.09% [kernel] [k] entry_SYSCALL_64
1.97% [kernel] [k] __virt_addr_valid
1.95% [kernel] [k] __cgroup_bpf_run_filter_sock_ops
1.95% [kernel] [k] lookup_nulls_elem_raw
1.89% [kernel] [k] __fget_light
1.42% [kernel] [k] __sys_sendto
1.41% [kernel] [k] entry_SYSCALL_64_after_hwframe
1.31% [kernel] [k] native_queued_spin_lock_slowpath
1.22% [kernel] [k] __check_object_size
1.18% [kernel] [k] tcp_established_options
1.04% bpf_prog_e7ccbf819f5be0d0_tcpopt [k] bpf_prog_e7ccbf819f5be0d0_tcpopt

Compare the above test results, fill up a CPU, you can find that
the upper limit of qps or BandWidth has a loss of nearly 18-20%.
Then CPU occupancy, you can find that "tcp_send_mss" has increased
significantly.

This helps prove the point, but what I actually had in mind is to check
"perf annotate bpf_skops_hdr_opt_len" and see if there any low hanging
fruit there which we can optimize.

For instance, when I benchmark it in a VM, I see we're spending cycles
mostly memset()/rep stos. I have no idea where the cycles are spent in
your case.

How do you do your pressure test? Can you send it to me for a try? Or you can try my pressure test method. Have you checked the calling frequency of bpf_skops_hdr_opt_len and bpf_skops_write_hdr_opt?

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 90706a47f6ff..f2092de1f432 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -6892,8 +6892,14 @@ enum {
* options first before the BPF program does.
*/
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG = (1<<6),
+ /* Fast path to reserve space in a skb under
+ * sock_ops->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB.
+ * opt length doesn't change often, so it can save in the tcp_sock. And
+ * set BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG to no bpf call.
+ */
+ BPF_SOCK_OPS_HDR_OPT_LEN_CACHE_CB_FLAG = (1<<7),

Have you considered a bpf_reserve_hdr_opt() flag instead?
An example or test coverage would to show this API extension in action
would help.

bpf_reserve_hdr_opt () flag can't finish this. I want to optimize
that bpf prog will not be triggered frequently before TSO. Provide
a way for users to not trigger bpf prog when opt len is unchanged.
Then when writing opt, if len changes, clear the flag, and then
change opt len in the next package.

I haven't seen a sample using the API extenstion that you're proposing,
so I can only guess. But you probably have something like:
SEC("sockops")
int sockops_prog(struct bpf_sock_ops *ctx)
{
if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
bpf_reserve_hdr_opt(ctx, N, 0);
bpf_sock_ops_cb_flags_set(ctx,
ctx->bpf_sock_ops_cb_flags |
MY_NEW_FLAG);
return 1;
}
}

Yes, that's what I expected.

I don't understand why you're saying it can't be transformed into:
int sockops_prog(struct bpf_sock_ops *ctx)
{
if (ctx->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB &&
ctx->args[0] == BPF_WRITE_HDR_TCP_CURRENT_MSS) {
bpf_reserve_hdr_opt(ctx, N, MY_NEW_FLAG);
return 1;
}
}

"bpf_reserve_hdr_opt (ctx, N, MY_NEW_FLAG);"

I don't know what I can do to pass the flag parameter, let
"bpf_reserve_hdr_opt" return quickly? But this is not useful,
because the loss caused by the triggering of bpf prog is very
expensive, and it is still on the hotspot function of sending
packets, and the TSO has not been completed yet.

[...]

This is not what I'm suggesting.

bpf_reserve_hdr_opt() has access to bpf_sock_ops_kern and even the
sock. You could either signal through bpf_sock_ops_kern to
bpf_skops_hdr_opt_len() that it should not be called again

Or even configure the tcp_sock directly from bpf_reserve_hdr_opt()
because it has access to it via bpf_sock_ops_kern{}.sk.

Oh, I see what you mean, this will achieve the goal.

Next message: Frank Wang: "Re: [PATCH v5 5/7] usb: typec: mux: Allow multiple mux_devs per mux"
Previous message: Alexandre TORGUE: "Re: [PATCH v2 3/3] arm64: dts: st: enable STM32 access controller for RCC"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]