Re: [RFC][PATCH 00/10] Add trace event support to eBPF

From: Tom Zanussi
Date: Tue Feb 16 2016 - 17:35:36 EST


On Sun, 2016-02-14 at 01:02 +0100, Alexei Starovoitov wrote:
> On Fri, Feb 12, 2016 at 10:11:18AM -0600, Tom Zanussi wrote:
> > Hi,
> >
> > As promised in previous threads, this patchset shares some common
> > functionality with the hist triggers code and enables trace events to
> > be accessed from eBPF programs.
>
> great that you've started working on BPF!
>
> > It needs to be applied on top of current tracing/for-next with the
> > hist triggers v13 patches applied:
> >
> > https://lkml.org/lkml/2015/12/10/640
> >
> > Any input welcome, especially things that could be done better wrt
> > eBPF, since I'm not as familiar with that code...
>
> this dependency looks odd. As I was saying a year ago I don't think

It's not actually a dependency, just a consequence of trying to share
code for this RFC that happened to already be in hist triggers. I could
just as well have removed the hunk that deletes that code from hist
triggers, but one of the main points was to show how they could be
shared.

> this hist triggers belong in the kernel. BPF already can do
> way more complex aggregation and histograms.

Way more? I still can't accomplish with eBPF some of the most basic and
helpful use cases that I can with hist triggers, such as using
stacktraces as hash keys. And the locking in the eBPF hashmap
implementation prevents anything like the function_hist [1] tracer from
being implemented on top of it:

.
.
.
ip: [ffffffffa0166350] gen7_render_get_cmd_length_mask hitcount: 18551173520
ip: [ffffffff817589b0] _cond_resched hitcount: 19242157574
ip: [ffffffff8175bd60] _raw_spin_lock hitcount: 19275222963
ip: [ffffffff81232c60] __fdget hitcount: 22865629762
ip: [ffffffff81232c00] __fget_light hitcount: 24594932641
ip: [ffffffff8175bed0] _raw_spin_lock_irqsave hitcount: 26510496751
ip: [ffffffff8175bb30] _raw_spin_unlock_irqrestore hitcount: 26525916042
ip: [ffffffffa018d6c0] gen6_ring_get_seqno hitcount: 95888137753

Totals:
Hits: 1020596753338
Entries: 7102
Dropped: 0

[1]
http://git.yoctoproject.org/cgit/cgit.cgi/linux-yocto-contrib/commit/?h=tzanussi/func-hist-v0&id=fa123181a068574bd9e300416cc947aa7424d808

> Take a look at all the tools written on top of it:
> https://github.com/iovisor/bcc/tree/master/tools

That's great, but it's all out-of-tree. Supporting out-of-tree users
has never been justification for merging in-kernel code (or for blocking
it from being merged).

> I don't buy into reasoning that hist triggers are needed to do
> performance analysis in the busybox. If not in busybox,
> just use perf+bpf or bcc.
>

I have systems with tiny amounts of memory and storage that have zero
chance of ever having a compiler or Python running on them. It's
indispensable on those systems to have access to all and nothing more
than the ftrace-based tools and trace event files directly. I even have
systems that don't have a shell at all, but can nonetheless access all
the ftrace and trace event functionality through a different channel.
hist triggers is just another one of those tools, a logical progression
towards allowing users in such a situation (among others) to make more
efficient and concise use of the trace event data. I don't see what
your problem with that is.

> that aside we do need to be able to attach bpf to tracepoints.
> The key goal of bpf+tracepoints is to be faster than bpf+kprobe.
> kprobes were good enough for almost all use cases so far except
> when we started processing millions of events per second via
> kprobe+bpf. At that point even optimized kprobe adds too much
> overhead.
>

It sounds from this like you're mainly interested in hooking into the
trace events but not the trace event data, which is fine, but other
tools might find some value in the trace event data, even if it may not
be the most optimal path.

I haven't measured the overhead of the cost of accessing data from the
trace buffers, but I do know that the hist triggers have no problem
logging millions of events per second.

In the past I have measured the basic hist triggers mechanism and found
it to be somewhat faster than the ftrace function tracer itself:

over a kernel compile:

no tracing:

real 110m47.817s
user 99m34.144s
sys 7m19.928s

function tracer enabled:

real 138m6.854s
user 100m57.692s
sys 36m27.724s

function_graph tracer enabled:

real 194m38.052s
user 102m32.192s
sys 97m47.056s

function_hist tracer enabled:

real 128m44.682s
user 100m29.080s
sys 26m52.880s

Given the fact that I haven't heard people complaining about the
overhead of even the function_graph tracer, I don't think the overhead
would be a show-stopper in most cases.

> Patch 9:
> + char common_pid_field_name1[] = "common_pid";
> + key.pid = bpf_trace_event_field_read(ctx, common_pid_field_name1);
>
> this is already achieved by bpf_get_current_pid_tgid()
> and it is several times faster.
>

Right, but the point of the examples was just to show that the mechanism
worked. I guess I should have named them tracexX_... instead, to make
that clear.

One point I would make about this though is that while it might be
slower to access this particular field that way, the user who's just
trying to get something done doesn't need to know about
bpf_get_current_pid_tgid() and can just look at the available fields in
the trace event format file and use them directly - trading off
efficiency for ease-of-use.

I also wonder about the trace event fields that don't boil down to
simple assignments - do each of them require 'helper' functions be
written in order to access the same data, if you're not interested in
using the trace event data?

> + char count_field_name1[] = "count";
> + count = bpf_trace_event_field_read(ctx, count_field_name1);
>
> Already possible by simple PT_REGS_PARM3(ctx) which is faster as well.

It may be 'simple' in that it's only 20 chars or so, but that seems to
imply a certain kind of deep knowledge on the part of the user in order
to be able to write even a simple eBPF program. Maybe some users would
like to just say, 'just get me the count value, I don't really want to
know all about this pt_regs thing'. BTW, where is PT_REGS_PARM3() and
how to use it documented for a new user, other than via the examples?

> And if you're using bcc you can write
> sys_read(struct pt_regs *ctx, int fd, char *buf, size_t count)
> { .. use 'count' ...}
> and bcc will automatically convert 'count' to ctx->dx.
> The latest perf can do this as well with slightly different syntax.
>
> Patch 10:
> + char len_field[] = "len";
> + len = bpf_trace_event_field_read(ctx, len_field);
> +
> + char name_field[] = "name";
> + bpf_trace_event_field_read_string(ctx, name_field, name, sizeof(name));
> +
> + char common_pid_field[] = "common_pid";
> + common_pid = bpf_trace_event_field_read(ctx, common_pid_field);
>
> all of the above can be done already and it's even demoed
> in samples/bpf/tracex1_kern.c

Right, and again, just an example to show that it works, and in this
case that you're getting the same data from the kprobe and the static
trace event.

>
> The main problem with this patch set is in patch 5:
> + field_name = (char *) (long) r2;
> + field = trace_find_event_field(ctx->call, field_name);
>
> This is very slow, since it's doing for_each_field(){strcmp()}
> That's the reason that simple ctx->register or bpf_probe_read()
> are much faster.
>

Yes, this and the next thing were the main reasons for making this an
RFC patch - I actually was hoping for someone to point out the 'right'
way to do this in eBPF.

Obviously there's no reason to do this lookup more than once - it should
be done once and then saved, but I ran into a few questions about that:

- should I have a helper that returns a pointer to the field, then save
it in a 'global hashmap' with one entry, like I've seen in other eBPF
code, and pass it in whenever accessing the field contents?
- I tried that but the top half was masked off, why?
- given that I only want something like the field lookup to happen
once, in eBPF what's the accepted way to do a 'begin probe' i.e. some
code like populating a global field array before the 'main' program
starts?

> There are few other issues with this set:
> Patch 2:
> + if (off < 0 || off >= TRACE_EVENT_CTX_HDR_SIZE + BUF_MAX_DATA_SIZE)
> + return false;
> that will allow bpf program to read a page worth of data from some
> stack pointer, since in patch 6:
> + struct trace_event_context ctx = { call, entry };
> + if (!trace_call_bpf(prog, &ctx))
> the kernel probably won't crash, but it shouldn't be allowed.
>

Right, the intent was to allow the eBPF program a somehow access a
pointer to the trace record, and grab the field directly. Obviously I
don't have a clue about how to make that properly happen in the eBPF
interpreter, and I don't see anything explaining all the voodoo
surrounding that even in the comments. I guess I'd have to spend a few
hours reading the BPF code and the verifier even, to understand that.
It would be nice if an interface that seems like it would be used by
other people wanting to extend eBPF could provide some hints in the
comments if not be thoroughly explained in the documentation.

> Patch 6 is odd.
> We already have prog_type_kprobe to work with kprobes.
> Having prog_type_trace_event there is unnecessary.
>

Not if you intend on making use of the trace event data. Which I think
I've realized that you don't, and therefore this whole patchset is
indeed pointless.

> The syscall part of Patch 7 is probably unnecessary too.
> When we benchmarked bpf+syscall it turned out that adding
> kprobe to sys_xx is actually faster than using syscall_enter() logic.
> May be things have changed. Need to re-measure.
>
> The tracepoint part of Patch 7 makes sense, but it has too much
> overhead to be competitive with kprobe.
> perf_trace_buf_prepare() does local_save_flags() which is quite
> expensive instruction when tracepoint is called million times a second.
> perf_fetch_caller_regs() is heavy too, since it zeros pt_regs which
> will be unused the bpf program.
>
> I'm also working on bpf+tracepoint.
> My plan was to add a variant of perf_trace_buf_prepare() that
> doesn't populate 'struct trace_entry' and just returns a pointer.
> Then perf_trace_##call() does tstruct and {assign;} into that
> 'entry' pointer and then call bpf prog with
> 'struct ctx {regs, entry_ptr, __data_size}'

Ah, I think this is what I was missing with the Patch 2 thing. Well, I
guess I'll find out what was the right way to do it when I see your
patch.

> Then bpf prog will call a helper to copy the data into its stack
> and will access it there as normal.
> It's going to be significantly faster than for_each_field+strcmp
> approach and little bit faster than kprobe, but I'm not happy
> with that speed yet.
> I'm still thinking on how to avoid 2nd copy and extra helper.
> The program should be able to access the entry_ptr directly,
> but some verifier magic needs to be done. so it's still wip.
>
> btw, ast@plumgrid is no longer functional and
> please cc netdev every time kernel/bpf/ is touched.
>

Why netdev? This has nothing to do with networking.