tracepoint_probe_register and bpf. Was: [PATCH tip 0/3] Improvements of scheduler related Tracepoints

From: Alexei Starovoitov
Date: Wed Dec 20 2017 - 21:04:13 EST

On Mon, Dec 18, 2017 at 10:11:57AM +0100, Peter Zijlstra wrote:
> On Fri, Dec 15, 2017 at 09:09:51AM -0800, Alexei Starovoitov wrote:
> > yeah. Currently bpf progs are called at the end of
> > perf_trace_##call()
> > {
> > .. regular tracepoint copy craft
> > perf_trace_run_bpf_submit( &copied args )
> > }
> >
> > from bpf pov we'd rather get access to raw args passed into
> > perf_trace_##call.
> > Sounds like you're suggesting to let bpf side register its
> > progs directly via tracepoint_probe_register() ?
> > That would solve the whole thing really nicely indeed.
> Not sure I thought that far; but if you want the probe arguments either
> grab them from the perf probe you're running in or indeed register your
> own, don't play silly buggers and copy them in the trace data.

So I've tried both approaches:

1. grab arguments from:
perf_trace_##call(void *__data, proto)

it works ok, but it adds 50-70 bytes of .text to every perf_trace_*
functions which multiplied by the number of tracepoints (~500) which in
total adds ~30k of .text to vmlinux
Not too bad, but it also adds extra branch to critical
execution path, so not ideal

2. register your own
This approach I think is much better.
The trick I'm doing is the following:
#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
/* no 'static' here. The bpf probe functions are global */ \
notrace void \
__bpf_trace_##call(void *__data, proto) \
{ \
struct trace_event_call *event_call = __data; \
__FN_COUNT(bpf_trace_run, args)(event_call, CAST_TO_U64(args)); \

Where CAST_TO_U64 is a macro to cast all args to u64 one by one.
bpf_trace_run*() functions are defined as:
void bpf_trace_run1(struct trace_event_call *call, u64 arg1);
void bpf_trace_run2(struct trace_event_call *call, u64 arg1, u64 arg2);
void bpf_trace_run3(struct trace_event_call *call, u64 arg1, u64 arg2, u64 arg3);

so in assembler it looks like:
(gdb) disassemble __bpf_trace_xdp_exception
Dump of assembler code for function __bpf_trace_xdp_exception:
0xffffffff81132080 <+0>: mov %ecx,%ecx
0xffffffff81132082 <+2>: jmpq 0xffffffff811231f0 <bpf_trace_run3>


TP_PROTO(const struct net_device *dev,
const struct bpf_prog *xdp, u32 act),

The above assembler snippet is casting 32-bit 'act' field into 'u64'
to pass into bpf_trace_run3() and all other registers are passed as-is.
So all of ~500 of __bpf_trace_*() functions are only 5-10 _byte_ long
and in total this approach adds 7k bytes to .text and 8k bytes
to .rodata since I want them to appear in kallsyms, so I don't
have to add another 8-byte fields to 'struct trace_event_class'
and 'struct trace_event_call'

Such approach, I think, is the lowest overhead we can possibly have
while calling trace_xdp_exception() from kernel C code and
transitioning into bpf land.
Since many folks are starting to use tracepoint+bpf at speeds of
1M+ events per second this would be very valuable optimization.

Diff stat so far:
drivers/gpu/drm/i915/i915_trace.h | 13 ++++++++++---
fs/btrfs/super.c | 4 ++++
include/linux/trace_events.h | 27 +++++++++++++++++++++++++++
include/trace/bpf_probe.h | 82 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
include/trace/define_trace.h | 1 +
include/uapi/linux/bpf.h | 4 ++++
kernel/trace/bpf_trace.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 191 insertions(+), 3 deletions(-)

Since I'm not touching anything on ftrace or perf side,
I'm thinking to extend BPF_PROG_ATTACH command to specify
which tracepoint to attach to.
The user space will look like:

prog_fd = bpf_prog_load(...);
bpf_prog_attach(prog_fd, "xdp_exception");

On the kernel side I will walk kallsyms to
find __tracepoint_xdp_exception record, check that
tp->name == "xdp_exception",
then find __bpf_trace_xdp_exception() address (also in kallsyms),
and finally use tracepoint_probe_register() to connect the two.