Re: Fix powerTOP regression with 2.6.39-rc5

From: Steven Rostedt
Date: Sat May 07 2011 - 06:45:19 EST


On Sat, 2011-05-07 at 08:58 +0200, Ingo Molnar wrote:
> * Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:

> You have just summed up the main philosophical difference between perf and
> ftrace: with perf we have a "sane tooling first" approach, while ftrace is
> still the old "kernel developers first" approach.

I actually believe that the opposite is true.

>
> In the past 10 years i pushed tons of instrumentation code upstream and for a
> long time the kernel-integrated ftrace approach looked like the technical best
> solution to me, but after 2 years of sane instrumentation tooling via a proper
> user-space ABI and tools/perf/ i'm not looking back.
>

I would like to point out that the problem with the ABI breakage came
through perf and not ftrace. From what I gathered from Linus's response,
is that, although I made a robust interface (the format of the events)
for tools to use, but it was possible for the tools to use another
interface to directly interact with the raw binary data. Since it was
easier to just map the raw binary data instead of using the exported
format, they did that instead, even though a library already existed to
parse the format and keep the events robust. And the "reality" is that
the raw binary format became the ABI.

With ftrace, there was no easy way to get at that raw format. It was
perf that exposed the raw binary formats that tools like powerTop used.
The "easy" way was just to use the raw binary format as perf made it
easy to access. Thus, instead of spending the time to use the proper
robust format, tools just mapped the raw binary format instead. Peter
Zijlstra, wisely saw this problem and asked me to randomize the fields
to prevent the raw mappings. But that would have broken the ease of use
of TRACE_EVENTS() for kernel developers, or would have drastically
slowed down the trace recording. We both reluctantly kept the fields the
same. Once again, I feel burned because I didn't listen to Peter ;)

Now the end of Linus's email, he gave a slight "but". It seems as though
not many tools are currently accessing the raw data, if all those tools
agree to convert to the proper format before too many others start, then
he may allow this change to take place. I already discussed this with
Arjan, and he agreed to use the libparsevent.so if I can get it packaged
with Fedora and Ubuntu. This is a robust solution, so that we do not get
stuck with things like recording for every single event, the pid,
preempt count, interrupt flags and other things in the kernel forever.


> I am strongly convinced that we need to bite the bullet and unify the two
> approaches to enable even better tooling: expose the remaining bits of tracing
> functionality not available via perf yet via the perf ABI and move it under a
> single umbrella, slowly phase out the ABI-unstable /debug/tracing/ debugfs crap
> for new features and use the strict perf ABI approach. Steve?

Actually, I now want to separate ftrace from perf even more. This
problem is not a ftrace problem but a perf one. The raw abi that tools
uses is from perf. Thus, that "padding" can be added to perf directly
instead of using the ftrace code, and powertop will still work, and
ftrace can change on the fly as all its tools use the libparsevent
libary.

Here's the choices then:

1) we get libparsevent.so out into the world and all tools can use it,
and the raw formats of the trace events will no longer be an issue as
long as the names of events and fields stay the same.

2) we separate perf from ftrace and keep the "stable" ABI for perf, and
let ftrace advance into a more efficient tracer.

-- Steve


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/