Re: Re: Re: [PATCH tip 0/9] tracing: attach eBPF programs to tracepoints/syscalls/kprobe

From: Alexei Starovoitov
Date: Tue Jan 20 2015 - 15:33:31 EST


On Tue, Jan 20, 2015 at 3:57 AM, Masami Hiramatsu
<masami.hiramatsu.pt@xxxxxxxxxxx> wrote:
>
> Ok, BTW, would you think is it possible to use a reusable small scratchpad
> memory for passing arguments? (just a thought)

sure. doable, but what's the use case?

>> It's not usable for high frequency events which
>> need this in-kernel aggregation.
>> If events are rare, then just dumping everything
>> into trace buffer is just fine. No in-kernel program is needed.
>
> Hmm, let me ensure your point, the performance number is the reason why
> we need to do it in the kernel, right? Not mainly for the flexibility but speed.

if user space can do X at the same speed as kernel,
then user space is a better choice and more flexible.
In case of bpf programs two things user space cannot do:
- fast aggregation without adding penalty to things being traced
- access to in-kernel data structures
And often both used together.
Say, we want to monitor amount of network traffic per user.
So we'd use trace_net_dev_xmit() tracepoint and do
map[current_uid()] += skb_len
as part of the program.
Overhead will be tiny and users won't notice any slowdown.
Trying to do the same in user space by enabling
this tracepoint has two problems:
high overhead and events are hard to aggregate
per user, since trace has 'pid', but short lived
processes will have dead pids in trace output.

> - perf probe and kprobe-event gives us a complete understandable
> interface for what will be recorded at where.
> (we can see the event definitions via kprobe_events interface,
> without any tools)
> - kprobe-event gives a completely same interface as other tracepoint
> events.
> - it also doesn't require any build-binary parts :) nor special tools.
> We can play with ftrace on just a small busybox.

yeah, when debugging in busybox is the goal
and 'cat' and 'echo' are your only tools, then
debugfs interface is the only choice :)

> However, this does NOT interfere your patch upstreaming. I just said current
> ftrace method is also meaningful for some reasons :)

of course :)
To emphasize the point I was trying to make with tracex1:
The program is a filter/aggregator. The bpf maps
are not suitable for streaming the events. That's the job
of ring buffer/trace_pipe. The program may choose
to aggregate some events and discard them (by
returning 0 from the program), and the rest of
the events will be streamed to user space via
ring buffer in the format statically defined by tracepoint
or by kprobe arguments.
The tracex1 example loads the program and then
reads /sys/kernel/debugfs/tracing/trace_pipe...

That part I was trying to improve with bpf_trace_printk:
to give ability to programs to stream data in a format
different from the one statically defined by tracepoints.
But trace_printk has its disadvantages, so probably
something cleaner is needed.
Like in my earlier example of trace_net_dev_xmit,
if the program could add printing of uid to arguments
already printed, it would have helped user space.

> By the way, I concern about that bpf compiler can become another systemtap,
> especially if you build it on llvm.
> Would you plan to develop it on kernel
> tree? or apart from the kernel-side development?

I'm not sure I completely understand the concern.
perf is using a bunch of out-of-tree libraries.
mcjit of llvm or libgccjit are another libraries.
Or may be eventually eBPF can be generated
by something like libpcap.

Ideally I would like to see 'perf run script.txt'
where script.txt is a program in a language suited
for tracing. The tracing language not necessary
will fit networking use cases. Currently I'm
using C for both and it's the most convenient,
but some folks complained that 'restricted'
nature of this C is hard to grasp, so I can only
encourage Jovi to do ktap language to bpf
translator. If it generates bpf directly that's great,
if it uses gcc or llvm backend that's fine too.

> I think it is hard to sync the development if you do it out-of-tree.

I think some pieces would have to be out of tree.
I've kept standalone llvm backend across 3.2, 3.3 and 3.4
but it gets polluted with ifdefs and not really a long term
solution, so now I'm working on upstreaming it
and feedback/codereviews I got, definitely improved
the quality of the bpf backend.
In case of backends the only bit to sync is instruction
set itself, which is stable. New instructions may be
added, but that's not a concern.
llvm backend doesn't care what language is
used in front-end or how programs are attached
to tracepoints or what set of bpf helper
functions is available.
All such bits and the main interface for
dynamic tracer, imo, should be in perf binary.
What it does underneath and how
many times it calls into llvm/gcc lib, won't be visible.
In case of systemtap compile time, for whatever
reason, is slow to the point of being annoying,
but here it should be instant.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/