Re: Fix powerTOP regression with 2.6.39-rc5

From: Ingo Molnar
Date: Wed May 11 2011 - 17:51:35 EST



* Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:

> [root@bxf perf]# ~/bin/perf record -a -e 'syscalls:*'
>
> Error: sys_perf_event_open() syscall returned with 24 (Too many open files). /bin/dmesg may provide additional information.
>
> Fatal: No CONFIG_PERF_EVENTS=y kernel support configured?

Yeah, this is a known bug, have you seen Peter's patch that addresses this?

People who run into this bug will go the way of least resistence: not fix it
and use ftrace. This is sadly how 'splitting a small pond into two' tends to
work out in practice: both halves stink a little bit more than they would if
they were kept together ;-)

This is why lttng as a separate project within the kernel was and is a bad idea
IMO.

I think this further strengthens the idea that we should join stuff and not
keep it split!

> > I really meant it when i told you that perf events were the natural next
> > step after ftrace, in the evolution of Linux tracing/instrumentation.
>
> I know you meant that, but I don't see nor feel it myself. [...]

My position is very simple: right now we have two tracing tools while for many
years we (including you!) always worked hard to have unified infrastructure.

For years ftrace was maintained and pushed upstream optimistically on the
assumption that we are reasonable people who can agree on technical solutions
objectively.

My technical point, at its core, is even simpler:

- If the ftrace UI/API/ABI design is better then perf can be migrated to it
and we can use the ftrace APIs to do more tooling goodness.
Everyone will be happy.

- If the perf UI/API/ABI design is better then ftrace can be migrated to it
and we can use the perf APIs to do more tooling goodness.
Everyone will be happy.

- If we do neither we will have continued tooling badness, tooling pain and
kernel-churn-without-a-clear-purpose. I will be sad.

Call me an egoist but i do not like being sad, i'd like to see one of the
options implemented where everyone is happy! :-)

So we could really have a dedicated tracing tool that can do what ftrace and
perf trace can do and much more. I fully expect that it would have an ftrace
work-alike workflow.

What we do not want is the current nightmare-ish design and schizm that we have
two different tracers and two different APIs trying to do the same thing
really. And that's been going on for two and a half years and counting and i do
not see much progress there so i'm getting worried about it ...

> [...] Maybe I'm mistaken but I don't have the belief that I can just jump on
> faith into perf and abandon all the work of current ftrace. But I'm happy to
> help unify the kernel infrastructure. That is the important part.

Well, nobody suggests any extreme of immediately 'throwing away everything',
especially as there's no clear replacement, why would we want to do that?

But at least having a very specific *idea* how to bring the two tracing tools
together quickly, and doing the first steps towards that, after a painful
period of 2.5 years, looks pretty essential to me.

I'd like to see the tracing pond grow, not fragment. Shrinking it by 10% to 90%
in the first step would still be much better if it can then have the focus and
clarity to grow to 300% or more - opposed to splitting it into two 50% parts
and see both halves rot in their own unique ways! :-)

> > Why not use the correctly designed tracing approach and enhance it, and
> > merge all the remaining useful bits of ftrace into it?
>
> The problem we have is that we disagree on what a correctly designed tracing
> approach is. Tracing is one of those things that everyone has a different
> idea of what is important. As you stated, you do not care about 4 bytes in an
> event. If you have 4 million events that is 4 million bytes. A typical event
> size could be 20 bytes, that 4 bytes is 1/5th of the event that is wasted
> space.

Well, look at the context:

- In the context of useful tools like PowerTop, which is driving *tons* of
useful new code upstream, 4 bytes is very little cost. It strongly filters
events to not be too intrusive to the system to begin with.

- In the context of perf record/report, which easily receives millions of
events, 4 bytes is still not measurable overhead.

- In the context of tracing workflows where you generate hundreds of millions
of events in a short timespan and store the stream as-is as gigabytes of
data, 4 bytes is probably measurable overhead.

So yes, there are definitely contexts/niches where 4 bytes are probably
measurable, but if weighted against the regression of *PowerTop* the cost is
negligible and it's not even a question which way we want to lean.

Also note that regardless of how tracing will look like in two years time, the
no regressions policy will always have *way* higher priority than any
micro-cost concerns.

Note that we are in fact are happy that applications use us, we are *happy*
that they do indeed *break* if we didnt continue to do the goodness that we are
doing today.

Consider the alternative: if we did things that no app and no developer is
interested in. It would just not matter to anyone. We could break it freely,
nobody would give a damn.

So i really prefer the 'apps are using us' situation we are in today, and not
breaking them is a *small* price to pay and it is a very small loss of the near
infinite degrees of development freedom we still enjoy in the kernel.

Also note that IMO there is no long-term technical problem really: i agree with
you that we can eventually get rid of the 4 bytes bkl field as well, if all
affected apps migrate to libperf.so in an orderly fashion.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/