Re: [PATCH v2 1/3] perf/core: Add a tracepoint for perf sampling
From: Alexei Starovoitov
Date: Fri Aug 05 2016 - 01:24:32 EST
On Thu, Aug 04, 2016 at 09:13:16PM -0700, Brendan Gregg wrote:
> On Thu, Aug 4, 2016 at 6:43 PM, Alexei Starovoitov
> <alexei.starovoitov@xxxxxxxxx> wrote:
> > On Thu, Aug 04, 2016 at 04:28:53PM +0200, Peter Zijlstra wrote:
> >> On Wed, Aug 03, 2016 at 11:57:05AM -0700, Brendan Gregg wrote:
> >>
> >> > As for pmu tracepoints: if I were to instrument it (although I wasn't
> >> > planning to), I'd put a tracepoint in perf_event_overflow() called
> >> > "perf:perf_overflow", with the same arguments. That could then be used
> >> > for all PMU overflow events, without needing to add specific
> >> > tracepoints.
> >>
> >> Could we not teach BPF to replace event->overflow_handler and inject
> >> itself there?
> >>
> >> We don't currently have nice interfaces for doing that, but it should be
> >> possible to do I think. We already have the indirect function call, so
> >> injecting ourself there has 0 overhead.
>
> Sounds like a good idea, especially for things like struct
> file_operations so that we can statically instrument file system
> read/writes with zero non-enabled overhead, and not worry about high
> frequency workloads (>10M events/sec).
>
> These perf probes aren't high frequency, though, and the code is not
> normally in use, so overhead should be much less of a concern.
> Sampling at 999 Hertz * CPUs is as frequent as I'd go. And if the
> tracepoint code is still adding a mem read, conditional, and branch,
> then that's not many instructions, especially considering the normal
> use case of these perf functions: creating records and writing to a
> perf ring buffer, then picking that up in user space by perf, then
> either processing it live or writing to perf.data, back to the file
> system, etc. It would be hard to benchmark the effect of adding a few
> instructions to that path (and any results may be more sensitive to
> cache line placement than the instructions).
tracepoints are actually zero overhead already via static-key mechanism.
I don't think Peter's objection for the tracepoint was due to overhead.
> The perf:perf_hrtimer probe point is also reading state mid-way
> through a function, so it's not quite as simple as wrapping the
> function pointer. I do like that idea, though, but for things like
> struct file_operations.
>
> >
> > you're right. All makes sense. I guess I was too lazy to look into
> > how to do it properly. Adding a tracepoint looked like quick and
> > easy way to achieve the same.
> > As far as api goes probably existing IOC_SET_BPF ioctl will do too.
> > Currently overflow_handler is set at event alloc time. If we start
> > changing it on the fly with atomic xchg(), afaik things shouldn't
> > break, since each overflow_handler is run to completion and doesn't
> > change global state, right?
> >
>
> How would it be implemented? I was thinking of adding explicit wrappers, eg:
instead of adding a tracepoint to perf_swevent_hrtimer we can replace
overflow_handler for that particular event with some form of bpf wrapper.
(probably new bpf program type). Then not only periodic events
will be triggering bpf prog, but pmu events as well.
So instead of normal __perf_event_output() writing into ringbuffer,
a bpf prog will be called that can optionally write into different
rb via bpf_perf_event_output. The question is what to pass into the
program to make the most use out of it. 'struct pt_regs' is done deal.
but perf_sample_data we cannot pass as-is, since it's kernel internal.
Probably something similar to __sk_buff mirror would be needed.
Another nice benefit of doing via overflow_handler instead of tracepoint
is that exclude_idle, exclude_user, exclude_kernel flags of the perf event
will all magically work and program will be event specific.
So two parallel 'perf record'-like sampling won't conflict.