Re: [PATCH v3 0/2] kstats: kernel metric collector

From: Luigi Rizzo
Date: Thu Feb 27 2020 - 05:31:18 EST


On Wed, Feb 26, 2020 at 3:11 PM Toke HÃiland-JÃrgensen <toke@xxxxxxxxxx> wrote:
>
> Luigi Rizzo <lrizzo@xxxxxxxxxx> writes:
>
> > - the runtime cost and complexity of hooking bpf code is still a bit
> > unclear to me. kretprobe or tracepoints are expensive, I suppose that
> > some lean hook replace register_kretprobe() may exist and the
> > difference from inline annotations would be marginal (we'd still need
> > to put in the hooks around the code we want to time, though, so it
> > wouldn't be a pure bpf solution). Any pointers to this are welcome;
> > Alexei mentioned fentry/fexit and bpf trampolines, but I haven't found
> > an example that lets me do something equivalent to kretprobe (take a
> > timestamp before and one after a function without explicit
> > instrumentation)
>
> As Alexei said, with fentry/fexit the overhead should be on par with
> your example. This functionality is pretty new, though, so I can
> understand why it's not obvious how to do things with it yet :)
>
> I think the best place to look is currently in selftests/bpf in the
> kernel sources. Grep for 'fexit' and 'fentry' in the progs/ subdir.
> test_overhead.c and kfree_skb.c seem to have some examples you may be
> able to work from.

Thank you for the precise reference, Toke.
I tweaked test_overhead.c to measure (using kstats) the cost of the various
hooks and I can confirm that fentry and fexit are pretty fast. The
following table
shows the p90 runtime of __set_task_comm() at low (100/s) and high (1M/s) rates:

90 percentile of __set_task_comm() runtime
(accuracy: 30ns)
call rate base kprobe kretprobe tracepoint fentry fexit
100/sec 270 870 1220 500 400 450
>1M/s 60 120 210 90
70 80

For high rate operation, the overhead of fentry and fexit is quite good,
even better than tracepoints, and well below the clock's accuracy
(more detailed measurements indicate ~5ns for fentry, ~10ns for fexit).
At very low call rates there is an extra 150-200ns
but that is expected due to the out of line code.

cheers
luigi