Hi Alexei and Wang,
On Thu, May 28, 2015 at 08:35:19PM -0700, Alexei Starovoitov wrote:
On Thu, May 28, 2015 at 03:14:44PM +0800, Wangnan (F) wrote:Looks very interesting and useful indeed!
On 2015/5/28 14:09, Alexei Starovoitov wrote:Agree. That's useful. That's exactly what I meant by
On Thu, May 28, 2015 at 11:09:50AM +0800, Wangnan (F) wrote:For me, enable eBPF program to read PMU counter is the first thing need to
be done.
The other thing is enabling eBPF programs to bring some information to perf
sample.
Here is an example to show my idea.
I have a program which:
int main()
{
while(1) {
read(...);
/* do A */
write(...);
/* do B */
}
}
Then by using following script:
SEC("enter=sys_write $outdata:u64")
int enter_sys_write(...) {
u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
bpf_store_value(cycles_cnt);
return 1;
}
SEC("enter=sys_read $outdata:u64")
int enter_sys_read(...) {
u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
bpf_store_value(cycles_cnt);
return 1;
}
by 'perf script', we can check the counter of cycles at each points, then we
are allowed
to compute the number of cycles between any two sampling points. This way we
can compute
how many cycles taken by A and B. If instruction counter is also recorded,
we will know
the IPC of A and B.
"compute a number of cache misses between two kprobe events".
The overhead is less when bpf program computes the cycle and instruction
delta, computes IPC and passes only final IPC numbers to the user space.
It can even average IPC over time.
For some very frequent events it can read cycle_cnt on sys_entry_read,
then read it on sys_exit_read, compute delta and average it into the map.
User space can read the map every second or every 10 seconds and print
nice graph.
As far as 'bpf_store_value' goes... I was thinking to expose perf ring_bufferThen we need to define and pass the format of such data so that perf
to bpf programs, so that program can stream any data to perf that receives
it via mmap. Then you don't need this '$outdata' hack.
tools can read and process the data. IIRC Masami suggested to have an
additional user event type for inserting/injecting non-perf events -
like PERF_RECORD_USER_DEFINED_TYPE? And its contents is something
similar to tracepoint event format file so that we can reuse existing
code to parse the event definition.
Thanks,
Namhyung