Re: [PATCH] perf/core: Add a tracepoint for perf sampling

From: Brendan Gregg
Date: Tue Aug 02 2016 - 22:51:11 EST


On Fri, Jul 29, 2016 at 8:34 PM, Wangnan (F) <wangnan0@xxxxxxxxxx> wrote:
>
>
> On 2016/7/30 2:05, Brendan Gregg wrote:
>>
>> On Tue, Jul 19, 2016 at 4:20 PM, Brendan Gregg <bgregg@xxxxxxxxxxx> wrote:
>>>
>>> When perf is performing hrtimer-based sampling, this tracepoint can be
>>> used
>>> by BPF to run additional logic on each sample. For example, BPF can fetch
>>> stack traces and frequency count them in kernel context, for an efficient
>>> profiler.
>>
>> Any comments on this patch? Thanks,
>>
>> Brendan
>
>
> Sorry for the late.
>
> I think it is a useful feature. Could you please provide an example
> to show how to use it in perf?

Yes, the following example samples at 999 Hertz, and emits the
instruction pointer only when it is within a custom address range, as
checked by BPF. Eg:

# ./perf record -e bpf-output/no-inherit,name=evt/ \
-e ./sampleip_range.c/map:channel.event=evt/ \
-a ./perf record -F 999 -e cpu-clock -N -a -o /dev/null sleep 5
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.000 MB /dev/null ]
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.134 MB perf.data (222 samples) ]

# ./perf script -F comm,pid,time,bpf-output
'bpf-output' not valid for hardware events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
'bpf-output' not valid for unknown events. Ignoring.
dd 6501 3058.117379:
BPF output: 0000: 3c 4c 21 81 ff ff ff ff <L!.....
0008: 00 00 00 00 ....

dd 6501 3058.130392:
BPF output: 0000: 55 4c 21 81 ff ff ff ff UL!.....
0008: 00 00 00 00 ....

dd 6501 3058.131393:
BPF output: 0000: 55 4c 21 81 ff ff ff ff UL!.....
0008: 00 00 00 00 ....

dd 6501 3058.149411:
BPF output: 0000: e1 4b 21 81 ff ff ff ff .K!.....
0008: 00 00 00 00 ....

dd 6501 3058.155417:
BPF output: 0000: 76 4c 21 81 ff ff ff ff vL!.....
0008: 00 00 00 00 ....

For that example, perf is running a BPF program to emit filtered
details, and running a second perf to configure sampling. We can
certainly improve how this works. And this will be much more
interesting once perf can emit maps, and a perf BPF program can
populate a map.

Here's sampleip_range.c:

/************************ BEGIN **************************/
#include <uapi/linux/bpf.h>
#include <uapi/linux/ptrace.h>

#define SEC(NAME) __attribute__((section(NAME), used))

/*
* Edit the following to match the instruction address range you want to
* sample. Eg, look in /proc/kallsyms. The addresses will change for each
* kernel version and build.
*/
#define RANGE_START 0xffffffff81214b90
#define RANGE_END 0xffffffff81214cd0

struct bpf_map_def {
unsigned int type;
unsigned int key_size;
unsigned int value_size;
unsigned int max_entries;
};

static int (*probe_read)(void *dst, int size, void *src) =
(void *)BPF_FUNC_probe_read;
static int (*get_smp_processor_id)(void) =
(void *)BPF_FUNC_get_smp_processor_id;
static int (*perf_event_output)(void *, struct bpf_map_def *, int, void *,
unsigned long) = (void *)BPF_FUNC_perf_event_output;

struct bpf_map_def SEC("maps") channel = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = __NR_CPUS__,
};

/* from /sys/kernel/debug/tracing/events/perf/perf_hrtimer/format */
struct perf_hrtimer_args {
unsigned long long pad;
struct pt_regs *regs;
struct perf_event *event;
};
SEC("perf:perf_hrtimer")
int func(struct perf_hrtimer_args *ctx)
{
struct pt_regs regs = {};
probe_read(&regs, sizeof(regs), ctx->regs);
if (regs.ip >= RANGE_START && regs.ip < RANGE_END) {
perf_event_output(ctx, &channel, get_smp_processor_id(),
&regs.ip, sizeof(regs.ip));
}
return 0;
}

char _license[] SEC("license") = "GPL";
int _version SEC("version") = LINUX_VERSION_CODE;
/************************* END ***************************/

>
> If I understand correctly, I can have a BPF script run 99 times per
> second using
>
> # perf -e cpu-clock/freq=99/ -e mybpf.c ...
>
> And in mybpf.c, attach a BPF script on the new tracepoint. Right?
>
> Also, since we already have timer:hrtimer_expire_entry, please provide
> some further information about why we need a new tracepoint.

timer:hrtimer_expire_entry fires for much more than just the perf
timer. The perf:perf_hrtimer tracepoint also has registers and perf
context as arguments, which can be used for profiling programs.

Thanks for the comments,

Brendan