Re: [PATCH v3 linux-trace 1/8] tracing: attach eBPF programs to tracepoints and syscalls

From: Steven Rostedt
Date: Tue Feb 10 2015 - 19:49:56 EST

On Tue, 10 Feb 2015 16:22:50 -0800
Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:

> > Yep, and if this becomes a standard, then any change that makes
> > trace_pipe different will be reverted.
> I think reading of trace_pipe is widespread.

I've heard of a few, but nothing that has broken when something changed.

Is it scripts or actual C code?

Again, it matters if people complain about the change.

> >> But some maintainers think of them as ABI, whereas others
> >> are using them freely. imo it's time to remove ambiguity.
> >
> > I would love to, and have brought this up at Kernel Summit more than
> > once with no solution out of it.
> let's try it again at plumbers in august?

Well, we need a statement from Linus. And it would be nice if we could
also get Ingo involved in the discussion, but he seldom comes to
anything but Kernel Summit.

> For now I'm going to drop bpf+tracepoints, since it's so controversial
> and go with bpf+syscall and bpf+kprobe only.

Probably the safest bet.

> Hopefully by august it will be clear what bpf+kprobes can do
> and why I'm excited about bpf+tracepoints in the future.

BTW, I wonder if I could make a simple compiler in the kernel that
would translate the current ftrace filters into a BPF program, where it
could use the program and not use the current filter logic.

> >> These tracepoint will receive one or two pointers to important
> >> structs only. They will not have TP_printk, assign and fields.
> >> The placement and arguments to these new tracepoints
> >> will be an ABI.
> >> All existing tracepoints are not.
> >
> > TP_printk() is not really an issue.
> I think it is. The way things are printed is the most
> visible part of tracepoints and I suspect maintainers don't
> want to add new ones, because internal fields are printed
> and users do parse trace_pipe.
> Recent discussion about tcp instrumentation was
> about adding new tracepoints and a module to print them.
> As soon as something like this is in, the next question
> 'what we're going to do when more arguments need
> to be printed'...

I should rephrase that. It's not that it's not an issue, it's just that
it hasn't been an issue. the trace_pipe code is slow. The
raw_trace_pipe is much faster. Any tool would benefit from using it.

I really need to get a library out to help users do such a thing.

> imo the solution is DEFINE_EVENT_BPF that doesn't
> print anything and a bpf program to process it.

You mean to be completely invisible to ftrace? And the debugfs/tracefs

> >
> >> it is portable and will run on any kernel.
> >> In uapi header we can define:
> >> struct task_struct_user {
> >> int pid;
> >> int prio;
> >
> > Here's a perfect example of something that looks stable to show to
> > user space, but is really a pimple that is hiding cancer.
> >
> > Lets start with pid. We have name spaces. What pid will be put there?
> > We have to show the pid of the name space it is under.
> >
> > Then we have prio. What is prio in the DEADLINE scheduler. It is rather
> > meaningless. Also, it is meaningless in SCHED_OTHER.
> >
> > Also note that even for SCHED_FIFO, the prio is used differently in the
> > kernel than it is in userspace. For the kernel, lower is higher.
> well, ->prio and ->pid are already printed by sched tracepoints
> and their meaning depends on scheduler. So users taking that
> into account.

I know, and Peter hates this.

> I'm not suggesting to preserve the meaning of 'pid' semantically
> in all cases. That's not what users would want anyway.
> I want to allow programs to access important fields and print
> them in more generic way than current TP_printk does.
> Then exposed ABI of such tracepoint_bpf is smaller than
> with current tracepoints.

Again, this would mean they become invisible to ftrace, and even

I'm not fully understanding what is to be exported by this new ABI. If
the fields available, will always be available, then why can't the
appear in a TP_printk()?

> > eBPF is very flexible, which means it is bound to have someone use it
> > in a way you never dreamed of, and that will be what bites you in the
> > end (pun intended).
> understood :)
> let's start slow then with bpf+syscall and bpf+kprobe only.

I'm fine with that.

> >> also not all bpf programs are equal.
> >> bpf+existing tracepoint is not ABI
> >
> > Why not?
> well, because we want to see more tracepoints in the kernel.
> We're already struggling to add more.

Still, the question is, even with a new "tracepoint" that limits what
it shows, there still needs to be something that is guaranteed to
always be there. I still don't see how this is different than the
current tracepoints.

> >> bpf+new tracepoint is ABI if programs are not using bpf_fetch
> >
> > How is this different?
> the new ones will be explicit by definition.

Who monitors this?

> > To give you an example, we thought about scrambling the trace event
> > field locations from boot to boot to keep tools from hard coding the
> > event layout. This may sound crazy, but developers out there are crazy.
> > And if you want to keep them from abusing interfaces, you just need to
> > be a bit more crazy than they are.
> that is indeed crazy. the point is understood.
> right now I cannot think of a solid way to prevent abuse
> of bpf+tracepoint, so just going to drop it for now.

Welcome to our world ;-)

> Cool things can be done with bpf+kprobe/syscall already.


-- Steve

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at