Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Addmissing user space support for config1/config2

From: Ingo Molnar
Date: Fri Apr 22 2011 - 09:19:15 EST

Next message: Lin Ming: "Re: perf_events: questions about cpu_has_ht_siblings() and offcoresupport"
Previous message: Frederic Weisbecker: "[PATCH 3/5] powerpc, hw_breakpoints: Fix racy access to ptrace breakpoints"
In reply to: Stephane Eranian: "Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Addmissing user space support for config1/config2"
Next in thread: Stephane Eranian: "Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Addmissing user space support for config1/config2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

* Stephane Eranian <eranian@xxxxxxxxxx> wrote:

> > Say i'm a developer and i have an app with such code:
> >
> > #define THOUSAND 1000
> >
> > static char array[THOUSAND][THOUSAND];
> >
> > int init_array(void)
> > {
> > int i, j;
> >
> > for (i = 0; i < THOUSAND; i++) {
> > for (j = 0; j < THOUSAND; j++) {
> > array[j][i]++;
> > }
> > }
> >
> > return 0;
> > }
> >
> > Pretty common stuff, right?
> >
> > Using the generalized cache events i can run:
> >
> > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> > Performance counter stats for './array' (10 runs):
> >
> > 6,719,130 cycles:u ( +- 0.662% )
> > 5,084,792 instructions:u # 0.757 IPC ( +- 0.000% )
> > 1,037,032 l1-dcache-loads:u ( +- 0.009% )
> > 1,003,604 l1-dcache-load-misses:u ( +- 0.003% )
> >
> > 0.003802098 seconds time elapsed ( +- 13.395% )
> >
> > I consider that this is 'bad', because for almost every dcache-load there's a
> > dcache-miss - a 99% L1 cache miss rate!
> >
> > Then i think a bit, notice something, apply this performance optimization:
>
> I don't think this example is really representative of the kind of problems
> people face, it is just too small and obvious. [...]

Well, the overwhelming majority of performance problems are 'small and obvious'
- once a tool roughly pinpoints their existence and location!

And you have not offered a counter example either so you have not really
demonstrated what you consider a 'real' example and why you consider
generalized cache events inadequate.

> [...] So I would not generalize on it.

To the contrary, it demonstrates the most fundamental concept of cache
profiling: looking at the hits/misses ratios and identifying hotspots.

That concept can be applied pretty nicely to all sorts of applications.

Interestly, the exact hardware event doesnt even *matter* for most problems, as
long as it *correlates* with the conceptual entity we want to measure.

So what we need are hardware events that correlate with:

- loads done
- stores done
- load misses suffered
- store misses suffered
- branches done
- branches missed
- instructions executed

It is the *ratio* that matters in most cases: before-change versus
after-change, hits versus misses, etc.

Yes, there will be imprecisions, CPU quirks, limitations and speculation
effects - but as long as we keep our eyes on the ball, generalizations are
useful for solving practical problems.

> If you are happy with generalized cache events then, as I said, I am fine
> with it. But the API should ALWAYS allow users access to raw events when they
> need finer grain analysis.

Well, that's a pretty far cry from calling it a 'myth' :-)

So my point is (outlined in detail in the common changelog) that we need sane
generalized remote DRAM events *first* - before we think about exposing the
'rest' of te offcore-PMU as raw events.

> > diff --git a/array.c b/array.c
> > index 4758d9a..d3f7037 100644
> > --- a/array.c
> > +++ b/array.c
> > @@ -9,7 +9,7 @@ int init_array(void)
> >
> > for (i = 0; i < THOUSAND; i++) {
> > for (j = 0; j < THOUSAND; j++) {
> > - array[j][i]++;
> > + array[i][j]++;
> > }
> > }
> >
> > I re-run perf-stat:
> >
> > $ perf stat --repeat 10 -e cycles:u -e instructions:u -e l1-dcache-loads:u -e l1-dcache-load-misses:u ./array
> >
> > Performance counter stats for './array' (10 runs):
> >
> > 2,395,407 cycles:u ( +- 0.365% )
> > 5,084,788 instructions:u # 2.123 IPC ( +- 0.000% )
> > 1,035,731 l1-dcache-loads:u ( +- 0.006% )
> > 3,955 l1-dcache-load-misses:u ( +- 4.872% )
> >
> > - I got absolute numbers in the right ballpark figure: i got a million loads as
> > expected (the array has 1 million elements), and 1 million cache-misses in
> > the 'bad' case.
> >
> > - I did not care which specific Intel CPU model this was running on
> >
> > - I did not care about *any* microarchitectural details - i only knew it's a
> > reasonably modern CPU with caching
> >
> > - I did not care how i could get access to L1 load and miss events. The events
> > were named obviously and it just worked.
> >
> > So no, kernel driven generalization and sane tooling is not at all a 'myth'
> > today, really.
> >
> > So this is the general direction in which we want to move on. If you know about
> > problems with existing generalization definitions then lets *fix* them, not
> > pretend that generalizations and sane workflows are impossible ...
>
> Again, to fix them, you need to give us definitions for what you expect those
> events to count. Otherwise we cannot make forward progress.

No, we do not 'need' to give exact definitions. This whole topic is more
analogous to physics than to mathematics. See my description above about how
ratios and high level structure matters more than absolute values and
definitions.

Yes, if we can then 'loads' and 'stores' should correspond to the number of
loads a program flow does, which you get if you look at the assembly code.
'Instructions' should correspond to the number of instructions executed.

If the CPU cannot do it it's not a huge deal in practice - we will cope and
hopefully it will all be fixed in future CPU versions.

That having said, most CPUs i have access to get the fundamentals right, so
it's not like we have huge problems in practice. Key CPU statistics are
available.

> Let me give just one simple example: cycles
>
> What your definition for the generic cycle event?
>
> There are various flavors:
> - count halted, unhalted cycles?

Again i think you are getting lost in too much detail.

For typical developers halted versus unhalted is mostly an uninteresting
distinction, as people tend to just type 'perf record ./myapp', which is per
workload profiling so it excludes idle time. So it would give the same result
to them regardless of whether it's halted or unhalted cycles.

( This simple example already shows the idiocy of the hardware names, calling
cycles events "CPU_CLK_UNHALTED.REF". In most cases the developer does *not*
care about those distinctions so the defaults should not be complicated with
them. )

> - impacted by frequency scaling?

The best default for developers is a frequency scaling invariant result - i.e.
one that is not against a reference clock but against the real CPU clock.

( Even that one will not be completely invariant due to the frequency-scaling
dependent cost of misses and bus ops, etc. )

But profiling against a reference frequency makes sense as well, especially for
system-wide profiling - this is the hardware equivalent of the cpu-clock /
elapsed time metric. We could implement the cpu-clock using reference cycles
events for example.

> LLC-misses:
> - what considered the LLC?

The last level cache is whichever cache sits before DRAM.

> - does it include code, data or both?

Both if possible as they tend to be unified caches anyway.

> - does it include demand, hw prefetch?

Do you mean for the LLC-prefetch events? What would be your suggestion, which
is the most useful metric? Prefetches are not directly done by program logic so
this is borderline. We wanted to include them for completeness - and the metric
should probably include 'all activities that program flow has not caused
directly and which may be sucking up system resources' - i.e. including hw
prefetch.

> - it is to local or remote dram?

The current definitions should include both.

Measuring remote DRAM accesss is of course useful - that is the original point
of this thread. It should be done as an additional layer, basically local RAM
is yet another cache level - but we can take other generalized approach as
well, if they make more sense.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Lin Ming: "Re: perf_events: questions about cpu_has_ht_siblings() and offcoresupport"
Previous message: Frederic Weisbecker: "[PATCH 3/5] powerpc, hw_breakpoints: Fix racy access to ptrace breakpoints"
In reply to: Stephane Eranian: "Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Addmissing user space support for config1/config2"
Next in thread: Stephane Eranian: "Re: [generalized cache events] Re: [PATCH 1/1] perf tools: Addmissing user space support for config1/config2"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]