Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event
From: Wangnan (F)
Date: Wed Jul 01 2015 - 02:23:01 EST
On 2015/7/1 13:44, Peter Zijlstra wrote:
On Wed, Jul 01, 2015 at 02:57:30AM +0000, He Kuang wrote:
This patch adds an extra perf trace buffer for other utilities like
bpf to fill extra data to perf events.
What!, why?
The goal of this patchset is to give BPF program a mean to output
something through
perf samples.
BPF programs give us a way to filter and aggregate events, which makes
us do many
interesting things. For example, we can count the number of context
switches in sys_write
system calls by attaching BPF programs onto the entry and exit points of
the system call
and the entry of __schedule, then count the number when exiting.
Combined with BPF
reading PMU which we are working on, BPF programs can be used to profile
kernel functions
in a fine-grained manner.
However, currently the only ways that BPF programs can transfer
something to perf are:
1. By returning 0 and 1 a BPF program can prevent perf to collect a
sample;
2. By map mechanism, user programs (perf) is possible to read the
aggregation result
computed by BPF program (not implemented now);
3. By BPF_FUNC_trace_printk they are able to output string into ftrace
ring buffer.
For the task I mentioned above, the best way do it is to print results
into ring buffer
in the program attached to sys_write%return, and merge them and
perf.data together using
timestamps.
We believe it can be improved. These patches is a try that, allows bpf
programs call something
like 'BPF_FUNC_output_sample' to output something, and collects them
with other data
output by a perf sample together. With the help of perf (not implemented
yet), perf will be
able to extract those data through 'perf script' or 'perf data convert
--to-ctf'. Some further
analysis can be made then.
The extra perf trace buffer is added for that reason. Currently, we use
perf_trace_buf as a
per_cpu buffer for other parts of a perf sample data. Making bpf program
to append information into
that buffer is possible, but requires us to caculate data size a perf
sample require (by calling
__get_data_size) before we can ensure the samples will not be filtered
out. Also, we can make
BPF program write from the beginning of that buffer and append perf
sample data to it. However,
they will not able to be parsed by current perf then.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/