Re: [PATCH] perf events: Add stalled cycles generic event -PERF_COUNT_HW_STALLED_CYCLES

From: Ingo Molnar
Date: Wed Apr 27 2011 - 11:48:48 EST



* Arun Sharma <arun@xxxxxxxxxxxxxxx> wrote:

> On Wed, Apr 27, 2011 at 4:11 AM, Ingo Molnar <mingo@xxxxxxx> wrote:
> > As for the first, 'overview' step, i'd like to use one or two numbers only, to
> > give people a general ballpark figure about how good the CPU is performing for
> > a given workload.
> >
> > Wouldnt UOPS_EXECUTED.CORE_ACTIVE_CYCLES,c=1,i=1 be in general a pretty good,
> > primary "stall" indicator? This is similar to the "cycles-uops_executed" value
> > in your script (UOPS_EXECUTED:PORT015:t=1 and UOPS_EXECUTED:PORT234_CORE
> > based): it counts cycles when there's no execution at all - not even
> > speculative one.
>
> If we're going to pick one stall indicator, [...]

Well, one stall indicator for the 'general overview' stage, plus branch misses.

Other stages can also have all sorts of details, including various subsets of
stall reasons. (and stalls of different units of the CPU)

We'll see how far it can be pushed.

> [...] why not pick cycles where no uops are retiring?
>
> cycles_no_uops_retired = cycles - c["UOPS_RETIRED:ANY:c=1:t=1"]
>
> In the presence of C-states and some halted cycles, I found that I couldn't
> measure it via UOPS_RETIRED:ANY:c=1:i=1 because it counts halted cycles too
> and could be greater than (unhalted) cycles.

Agreed, good point.

You are right that it is more robust to pick 'the CPU was busy on our behalf'
metric instead of a 'CPU is idle' metric, because that way 'HLT' as a special
type of idling around does not have to be identified.

HLT is not an issue for the default 'perf stat' behavior (because it only
measures task execution, never the idle thread or other tasks not involved with
the workload), but for per CPU and system-wide (--all) it matters.

I'll flip it around.

> The other issue I had to deal with was UOPS_RETIRED > UOPS_EXECUTED
> condition. I believe this is caused by what AMD calls sideband stack
> optimizer and Intel calls dedicated stack manager (i.e. UOPS executed outside
> the main pipeline). A recursive fibonacci(30) is a good test case for
> reproducing this.

So the PORT015+234 sum is not precise? The definition seems to be rather firm:

Counts number of Uops executed that where issued on port 2, 3, or 4.
Counts number of Uops executed that where issued on port 0, 1, or 5.

Wouldnt that include all uops?

> > Is this the direction you'd like to see perf stat to move into? Any
> > comments, suggestions?
>
> Looks like a step in the right direction. Thanks.

Ok, great - will keep you updated. I doubt the defaults can ever beat truly
expert use of PMU events: there will always be fine details that a generic
approach will miss. But i'd be happy if we got 70% the way ...

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/