Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

From: Andrii Nakryiko
Date: Tue Apr 30 2024 - 12:30:03 EST


On Tue, Apr 30, 2024 at 6:32 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
>
> On Mon, 29 Apr 2024 13:25:04 -0700
> Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
>
> > On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
> > >
> > > Hi Andrii,
> > >
> > > On Thu, 25 Apr 2024 13:31:53 -0700
> > > Andrii Nakryiko <andrii.nakryiko@xxxxxxxxx> wrote:
> > >
> > > > Hey Masami,
> > > >
> > > > I can't really review most of that code as I'm completely unfamiliar
> > > > with all those inner workings of fprobe/ftrace/function_graph. I left
> > > > a few comments where there were somewhat more obvious BPF-related
> > > > pieces.
> > > >
> > > > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > > > and then with your series applied on top. Just to see if there are any
> > > > regressions. I think it will be a useful data point for you.
> > >
> > > Thanks for testing!
> > >
> > > >
> > > > You should be already familiar with the bench tool we have in BPF
> > > > selftests (I used it on some other patches for your tree).
> > >
> > > What patches we need?
> > >
> >
> > You mean for this `bench` tool? They are part of BPF selftests (under
> > tools/testing/selftests/bpf), you can build them by running:
> >
> > $ make RELEASE=1 -j$(nproc) bench
> >
> > After that you'll get a self-container `bench` binary, which has all
> > the self-contained benchmarks.
> >
> > You might also find a small script (benchs/run_bench_trigger.sh inside
> > BPF selftests directory) helpful, it collects final summary of the
> > benchmark run and optionally accepts a specific set of benchmarks. So
> > you can use it like this:
> >
> > $ benchs/run_bench_trigger.sh kprobe kprobe-multi
> > kprobe : 18.731 ± 0.639M/s
> > kprobe-multi : 23.938 ± 0.612M/s
> >
> > By default it will run a wider set of benchmarks (no uprobes, but a
> > bunch of extra fentry/fexit tests and stuff like this).
>
> origin:
> # benchs/run_bench_trigger.sh
> kretprobe : 1.329 ± 0.007M/s
> kretprobe-multi: 1.341 ± 0.004M/s
> # benchs/run_bench_trigger.sh
> kretprobe : 1.288 ± 0.014M/s
> kretprobe-multi: 1.365 ± 0.002M/s
> # benchs/run_bench_trigger.sh
> kretprobe : 1.329 ± 0.002M/s
> kretprobe-multi: 1.331 ± 0.011M/s
> # benchs/run_bench_trigger.sh
> kretprobe : 1.311 ± 0.003M/s
> kretprobe-multi: 1.318 ± 0.002M/s s
>
> patched:
>
> # benchs/run_bench_trigger.sh
> kretprobe : 1.274 ± 0.003M/s
> kretprobe-multi: 1.397 ± 0.002M/s
> # benchs/run_bench_trigger.sh
> kretprobe : 1.307 ± 0.002M/s
> kretprobe-multi: 1.406 ± 0.004M/s
> # benchs/run_bench_trigger.sh
> kretprobe : 1.279 ± 0.004M/s
> kretprobe-multi: 1.330 ± 0.014M/s
> # benchs/run_bench_trigger.sh
> kretprobe : 1.256 ± 0.010M/s
> kretprobe-multi: 1.412 ± 0.003M/s
>
> Hmm, in my case, it seems smaller differences (~3%?).
> I attached perf report results for those, but I don't see large difference.

I ran my benchmarks on bare metal machine (and quite powerful at that,
you can see my numbers are almost 10x of yours), with mitigations
disabled, no retpolines, etc. If you have any of those mitigations it
might result in smaller differences, probably. If you are running
inside QEMU/VM, the results might differ significantly as well.

>
> > > >
> > > > BASELINE
> > > > ========
> > > > kprobe : 24.634 ± 0.205M/s
> > > > kprobe-multi : 28.898 ± 0.531M/s
> > > > kretprobe : 10.478 ± 0.015M/s
> > > > kretprobe-multi: 11.012 ± 0.063M/s
> > > >
> > > > THIS PATCH SET ON TOP
> > > > =====================
> > > > kprobe : 25.144 ± 0.027M/s (+2%)
> > > > kprobe-multi : 28.909 ± 0.074M/s
> > > > kretprobe : 9.482 ± 0.008M/s (-9.5%)
> > > > kretprobe-multi: 13.688 ± 0.027M/s (+24%)
> > >
> > > This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> > > eventually because it should be a single callback version of
> > > kretprobe-multi.
>
> I ran another benchmark (prctl loop, attached), the origin kernel result is here;
>
> # sh ./benchmark.sh
> count = 10000000, took 6.748133 sec
>
> And the patched kernel result;
>
> # sh ./benchmark.sh
> count = 10000000, took 6.644095 sec
>
> I confirmed that the parf result has no big difference.
>
> Thank you,
>
>
> > >
> > > >
> > > > These numbers are pretty stable and look to be more or less representative.
> > > >
> > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > > > about the same, though.
> > > >
> > > > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > > > slower, almost by 10%. Not sure why, but looks real after re-running
> > > > benchmarks a bunch of times and getting stable results.
> > >
> > > Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> > > So nothing should be changed. Maybe cache access pattern has been
> > > changed?
> > > I'll check it with tracefs (to remove the effect from bpf related changes)
> > >
> > > >
> > > > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > > > Again, I don't know if it is expected or not, but it's a nice
> > > > improvement.
> > >
> > > Thanks!
> > >
> > > >
> > > > If you have any idea why kretprobes would get so much slower, it would
> > > > be nice to look into that and see if you can mitigate the regression
> > > > somehow. Thanks!
> > >
> > > OK, let me check it.
> > >
> > > Thank you!
> > >
> > > >
> > > >
> > > > > 51 files changed, 2325 insertions(+), 882 deletions(-)
> > > > > create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> > > > >
> > > > > --
> > > > > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
> > > > >
> > >
> > >
> > > --
> > > Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>
>
>
> --
> Masami Hiramatsu (Google) <mhiramat@xxxxxxxxxx>