Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

From: Wangnan (F)
Date: Wed Jul 01 2015 - 02:23:01 EST




On 2015/7/1 13:44, Peter Zijlstra wrote:
On Wed, Jul 01, 2015 at 02:57:30AM +0000, He Kuang wrote:
This patch adds an extra perf trace buffer for other utilities like
bpf to fill extra data to perf events.
What!, why?

The goal of this patchset is to give BPF program a mean to output something through
perf samples.

BPF programs give us a way to filter and aggregate events, which makes us do many
interesting things. For example, we can count the number of context switches in sys_write
system calls by attaching BPF programs onto the entry and exit points of the system call
and the entry of __schedule, then count the number when exiting. Combined with BPF
reading PMU which we are working on, BPF programs can be used to profile kernel functions
in a fine-grained manner.

However, currently the only ways that BPF programs can transfer something to perf are:

1. By returning 0 and 1 a BPF program can prevent perf to collect a sample;
2. By map mechanism, user programs (perf) is possible to read the aggregation result
computed by BPF program (not implemented now);
3. By BPF_FUNC_trace_printk they are able to output string into ftrace ring buffer.

For the task I mentioned above, the best way do it is to print results into ring buffer
in the program attached to sys_write%return, and merge them and perf.data together using
timestamps.

We believe it can be improved. These patches is a try that, allows bpf programs call something
like 'BPF_FUNC_output_sample' to output something, and collects them with other data
output by a perf sample together. With the help of perf (not implemented yet), perf will be
able to extract those data through 'perf script' or 'perf data convert --to-ctf'. Some further
analysis can be made then.

The extra perf trace buffer is added for that reason. Currently, we use perf_trace_buf as a
per_cpu buffer for other parts of a perf sample data. Making bpf program to append information into
that buffer is possible, but requires us to caculate data size a perf sample require (by calling
__get_data_size) before we can ensure the samples will not be filtered out. Also, we can make
BPF program write from the beginning of that buffer and append perf sample data to it. However,
they will not able to be parsed by current perf then.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/