Re: [RFC 00/13] perf bpf: Add support to run BEGIN/END code

From: Arnaldo Carvalho de Melo
Date: Mon Mar 12 2018 - 09:56:38 EST


Em Mon, Mar 12, 2018 at 12:17:05PM +0100, Jiri Olsa escreveu:
> adding Alexei and Wang to the loop
>
> On Mon, Mar 12, 2018 at 10:43:00AM +0100, Jiri Olsa wrote:
> > hi,
> > this is *RFC* and the following patchset is very rough
> > and ugly 'prove of concept'-kind-of-toy code. I'm mostly
> > interested in opinions about if this could be useful in
> > your current eBPF usage.
> >
> > Currently we can load eBPF code within the record command
> > and attach it to event. We have 2 ways of communicating
> > the data back to user: bpf-output event that goes to
> > perf.data or 'trace_printk' output in tracefs buffer.
> >
> > AFAICS we're not covering quite large usage base that runs
> > code before and once the probe is finished to setup, collect
> > and display the collected data.
> >
> > This patchset is adding support to run BEGIN and END
> > code snipets before and after eBPF probe is loaded.

Right, with all the code that Wang contributed, and reusing that
begin/end code from systemtap, it was easy to do it, not that much code
added, so I don't see a reason for this not to be merged.

On top of this patchset, I think that the restricted C code that is used
to write the eBPF utilities should be simplified, I've toyed with this
from time to time, for instance:

[root@jouet bpf]# cat o_cloexec.c
#include "bpf.h"
#include "stdio.h"

#define O_CLOEXEC 0x80000

int syscall_enter(openat)
{
char filename[256];
int flags = syscall_field_int(flags, 32);
int len = syscall_field_str(filename, 24);

if (!(flags & O_CLOEXEC))
return 0;

perf_stdout(filename, len);
return 1;
}

[root@jouet bpf]# perf trace -e openat,o_cloexec.c
0.573 ( ): __bpf_stdout__:/etc/ld.so.cache....)
0.576 ( ): syscalls:sys_enter_openat:dfd: 0xffffffffffffff9c, filename: 0x7fc4de411563, flags: 0x00080000, mode: 0x00000000)
0.579 ( 0.013 ms): sh/17728 openat(dfd: CWD, filename: /etc/ld.so.cache, flags: CLOEXEC ) = 3
0.620 ( ): __bpf_stdout__:/lib64/libtinfo.so.6........)
0.622 ( ): syscalls:sys_enter_openat:dfd: 0xffffffffffffff9c, filename: 0x7fc4de619ce0, flags: 0x00080000, mode: 0x00000000)
0.624 ( 0.013 ms): sh/17728 openat(dfd: CWD, filename: /lib64/libtinfo.so.6, flags: CLOEXEC ) = 3
0.705 ( ): __bpf_stdout__:/lib64/libdl.so.2...)
0.708 ( ): syscalls:sys_enter_openat:dfd: 0xffffffffffffff9c, filename: 0x7fc4de5ef4c0, flags: 0x00080000, mode: 0x00000000)
0.710 ( 0.058 ms): sh/17728 openat(dfd: CWD, filename: /lib64/libdl.so.2, flags: CLOEXEC ) = 3
0.852 ( ): __bpf_stdout__:/lib64/libc.so.6....)
0.857 ( ): syscalls:sys_enter_openat:dfd: 0xffffffffffffff9c, filename: 0x7fc4de5ef9a0, flags: 0x00080000, mode: 0x00000000)
0.860 ( 0.021 ms): sh/17728 openat(dfd: CWD, filename: /lib64/libc.so.6, flags: CLOEXEC ) = 3
^C
[root@jouet bpf]#

Hiding details such as:

[root@jouet bpf]# cat stdio.h
struct bpf_map_def SEC("maps") __bpf_stdout__ = {
.type = BPF_MAP_TYPE_PERF_EVENT_ARRAY,
.key_size = sizeof(int),
.value_size = sizeof(u32),
.max_entries = __NR_CPUS__,
};

#define perf_stdout(from, len) \
perf_event_output(ctx, &__bpf_stdout__, BPF_F_CURRENT_CPU, \
&from, len & (sizeof(from) - 1));
[root@jouet bpf]#

That 'perf trace' will setup "bpf_output" event, etc.

And the other macros:

#define SEC(NAME) __attribute__((section(NAME), used))

#define pid_map(name, value_type) \
struct bpf_map_def SEC("maps") name = { \
.type = BPF_MAP_TYPE_HASH, \
.key_size = sizeof(u64), \
.value_size = sizeof(value_type), \
.max_entries = 500, \
}

#define syscall_enter(name) \
SEC("syscalls:sys_enter_" #name) syscall_enter_ ## name(void *ctx)

#define syscall_exit(name) \
SEC("syscalls:sys_exit_" #name) syscall_exit_ ## name(void *ctx)

#define syscall_field_str(field, offset) \
({ char *__ptr = *((char **)(ctx + offset)); \
bpf_probe_read_str(field, sizeof(field), __ptr); })

#define syscall_field_int(field, offset) \
({ int *__ptr = (int *)(ctx + offset); \
bpf_probe_read(&field, sizeof(field), __ptr); field; }

While this hides some of the details, it still hardcodes the offset, so
should be used that way, I was trying to read about clang internals to
do some preprocessing trick that would automagically make the tracepoint
fields accessible as local variables, reading the tracepoint format
files from the running system or from the description stored in the
perf.data header, when running these things on perf.data files.

- Arnaldo