Re: [PATCH bpf-next v1] selftests/bpf: Add benchmark for tail call performance

From: bot+bpf-ci

Date: Mon Jun 29 2026 - 23:12:48 EST

> diff --git a/tools/testing/selftests/bpf/benchs/run_bench_tailcall.sh b/tools/testing/selftests/bpf/benchs/run_bench_tailcall.sh
> new file mode 100755
> index 000000000000..c687f34455e8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/benchs/run_bench_tailcall.sh
> @@ -0,0 +1,18 @@
> +#!/bin/bash
> +# SPDX-License-Identifier: GPL-2.0
> +
> +# 1. Load the official common benchmark utilities
> +source ./benchs/run_common.sh
> +
> +# 2. Strict error handling configurations
> +set -eufo pipefail
> +
> +# 3. Use default bench binary path if not exported by the framework
> +BENCH_BIN=${BENCH:-./bench}
> +
> +# 4. Run with strict core affinity and isolation for reliable profiling
> +RUN_BENCH="numactl --physcpubind=0,2 --membind=0 nice -n -20 $BENCH_BIN -w5 -d20 -a"
> +
> +# 5. Capture the output string and pass it straight into summarize_ops
> +# This satisfies the framework's internal parameter bounds without triggering set -u.
> +summarize_ops "tailcall" "$($RUN_BENCH tailcall)"

This isn't a bug, but could the numbered step comments (# 1. .. # 5.) be
trimmed? Several of them restate the line right below them, for example:

# 1. Load the official common benchmark utilities
source ./benchs/run_common.sh

The neighbouring run_bench_*.sh scripts here tend to be a couple of lines
with little commentary, so the running narration stands out. Keeping only
the comments that explain a non-obvious choice (such as the cpu-affinity
flags) might fit the directory better.

> diff --git a/tools/testing/selftests/bpf/progs/tailcall_bench.c b/tools/testing/selftests/bpf/progs/tailcall_bench.c
> new file mode 100644
> index 000000000000..68a50c7b1d06
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/tailcall_bench.c

[ ... ]

> +/* High-performance Per-CPU Array Map to eliminate global lock variance */
> +struct {
> + __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
> + __uint(max_entries, 1);
> + __type(key, __u32);
> + __type(value, __u64);
> +} pcpu_hits_map SEC(".maps");

[ ... ]

> +static __always_inline void increment_pcpu_counter(void)
> +{
> + __u32 key = 0;
> + __u64 *val;
> +
> + val = bpf_map_lookup_elem(&pcpu_hits_map, &key);
> + if (val) {
> + /* Lockless and stable per-CPU increment without cacheline bounce */
> + (*val)++;
> + }
> +}

This isn't a bug, but do the "high-performance / lockless / no cacheline
bounce" comments add anything beyond what the per-CPU map type already
conveys? The increment comment in particular sits directly above a plain
(*val)++;. Could these be dropped?

The commit message has a similar question. Quoting it in full:

> Add a dedicated benchmark to measure the runtime performance and overhead
> of tail calls. This helps developers detect performance regressions across
> different kernel versions and optimization phases.
>
> The benchmark sets up a standard tracepoint to intercept syscalls triggered
> by a dedicated producer thread running a dead loop. The execution path is
> strictly bounded by the tail call depth limit, safely preventing any core
> lockup or infinite recursion risks.
>
> To eliminate cacheline bouncing and global locking variance, Per-CPU array
> maps are utilized to track execution hits across multiple cores.
>
> To evaluate the JIT compiler architecture under complex control flows, it
> interleaves direct tail calls with bpf2bpf tail calls.
>
> This forces the tracking context at the target program's entry prologue to
> toggle dynamically between a scalar count (0 to 33) and a massive kernel
> pointer address, providing a robust micro-architectural stress test which
> consists of:
>
> 1. tailcall_bench_main: The entry program filtering processes by PID, and
> introducing a high-frequency alternating execution path via the syscall
> arguments to switch between a direct tail call and a bpf2bpf tail call.
> 2. tailcall_bench_target: The final target destination hop which safely
> terminates the mixed execution flow and increments the step counter.
>
> All functions utilize explicit "struct tracepoint_raw_syscalls_sys_enter"
> context types to ensure strict type alignment and clear pointer provenance
> for the BPF verifier.
>
> Additionally, provide a test script run_bench_tailcall.sh to automate the
> execution under strict core affinity and isolation for reliable profiling,
> formatting the captured metrics directly into the performance report.
>
> Signed-off-by: Tiezhu Yang <yangtiezhu@xxxxxxxxxxx>

This isn't a bug, but could the changelog focus more on why a dedicated
tail-call throughput benchmark is needed over the existing infrastructure?

The text mostly walks through what each program does, and uses phrasing like
"robust micro-architectural stress test", "massive kernel pointer address",
and "strict type alignment and clear pointer provenance". A reviewer comes
away knowing what the two programs are, but less about the load-bearing
design rationale.

---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/28416687457