Re: [RFC PATCH 0/5] Make eBPF programs output data to perf event

From: Wangnan (F)
Date: Thu Jul 02 2015 - 05:26:35 EST




On 2015/7/2 11:52, Alexei Starovoitov wrote:
On 7/1/15 8:38 PM, He Kuang wrote:


On 2015/7/2 10:48, Alexei Starovoitov wrote:
On 7/1/15 4:58 AM, Peter Zijlstra wrote:

But why create a separate trace buffer, it should go into the regular
perf buffer.

+1

I think
+static char __percpu *perf_extra_trace_buf[PERF_NR_CONTEXTS];
is redundant.
It adds quite a bit of unnecessary complexity to the whole patch set.

Also the call to bpf_output_sample() is not effective unless program
returns 1. It's a confusing user interface.

Also you cannot ever do:
BPF_FUNC_probe_read,
+ BPF_FUNC_output_sample,
BPF_FUNC_ktime_get_ns,
new functions must be added to the end.

Why not just do:
perf_trace_buf_prepare() + perf_trace_buf_submit() from the helper?
No changes to current code.
No need to call __get_data_size() and other overhead.
The helper can be called multiple times from the same program.
imo much cleaner.


Invoke perf_trace_buf_submit() will generate a second perf
event (header->type = PERF_RECORD_SAMPLE) entry which is
different from the event entry outputed by the orignial
kprobe. So the final result of the example in 00/00 patch may
like this:

sample entry 1(from bpf_prog):
comm timestamp1 generic_perform_write pmu_value=0x1234
sample entry 2(from original kprobe):
comm timestamp2 generic_perform_write: (ffffffff81140b60)
Compared with current implementation:
combined sample entry:
comm timestamp generic_perform_write: (ffffffff81140b60)
pmu_value=0x1234

The former two entries may be discontinuous as there are multiple
threads and kprobes to be recorded, and there's a chance that one
entry is missed but the other is recorded. What we need is the
pmu_value read when 'generic_perform_write' enters, the two
entries result is not intuitive enough and userspace tools have
to do the work to find and combine those two sample entries to
get the result.

Just change your example to return 0 and user space will see
one sample.


Yes, by using perf_trace_buf_prepare() + perf_trace_buf_submit() in
helper function and let bpf program always returns 0 we can make data
collected by BPF programs output into samples, if following problems
are solved:

1. In bpf program there's no way to get 'struct perf_event' or 'struct
ftrace_event_call'. We have to deduce them through pt_regs:

pt_regs -> ip -> kprobe -> struct trace_kprobe -> struct
ftrace_event_call -> hlist_entry -> struct perf_event

Which seems dirty, but without that we can't call
perf_trace_buf_submit().

2. Even if we finally get 'struct perf_event', I'm not sure whether
user really concern on it. If we really concern on all information
output through perf_trace_buf_submit() like callstack and
register, why not make bpf program return non-zero instead? But then
we have to consider how to connect two samples together.

So maybe writing a new function to replace perf_trace_buf_submit() and
output some light-weight information instead of full event data is
worth considering. Otherwise, maybe a dummy 'struct perf_event' for BPF
outputing is also acceptable?

What we are trying to do in previous patches is to merge data output by
BPF programs and original data output by perf_trace_buf_submit()
together. For example (expressed in CTF metadata format):

event.header := struct { // both output by perf_trace_buf_submit()
integer { ... } id;
integer { ... } timestamp;
}
event {
name = "perf_bpf_probe:lock_page";
...
fields := struct {
integer { ... } perf_ip; // perf_trace_buf_submit()
integer { ... } perf_tid; // perf_trace_buf_submit()
...
integer { ... } page; <-- Fetched using prologue
integer { ... } cycle_cmu_counter; <-- Output by BPF program
}
}

We believe that implemented should be simpler. Whether to use an extra
perf_trace_buf or not can be discussed. We have other choices. For
example, we can make BPF program write its data from the end of
bpf_trace_buf, and connect two parts of output data before calling
perf_trace_buf_submit().

Thank you.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/