Re: II.2 - Event knowledge missing

From: stephane eranian
Date: Tue Jun 23 2009 - 09:19:06 EST

On Mon, Jun 22, 2009 at 1:57 PM, Ingo Molnar<mingo@xxxxxxx> wrote:
>> 2/ Event knowledge missing
>> There are constraints on events in Intel processors. Different
>> constraints do exist on AMD64 processors, especially with
>> uncore-releated events.
> You raise the issue of uncore events in IV.1, but let us reply here
> primarily.
> Un-core counters and events seem to be somewhat un-interesting to
> us. (Patches from those who find them interesting are welcome of
> course!)
That is you opinion but not mine. I believe uncore is useful though
it is harder to manage than core PMU. I know that because I have
implemented the support for Nehalem. But going back to our discussion
from December, if it's there it's because it provides some value-add,
why would the hardware designers have bothered otherwise?

It is true that if you've only read the uncore description in Volume 3b, it
is not clear what this can actually do. Therefore, I recommend you take
a look at section B.2.5 of the Intel optimization manual:

It shows a bunch of interesting metrics one can collect using uncore.
Metrics which you cannot get any other way. Some people do care
about those, otherwise they would not be explained.

> The main problem with uncore events is that they are per physical
> package, and hence tying a piece of physical metric exposed via them
> to a particular workload is hard - unless full-system analysis is
> performed. 'Task driven' metrics seem far more useful to performance
> analysis (and those are the preferred analysis method of most
> user-space developers), as they allow particularized sampling and
> allow the tight connection between workload and metric.
That is the nature of the beast. There is not much you can do about
this. But this is still useful especially if you have a symmetrical
workload like many scientific applications have.

Note that uncore also exist on AMD64, though, not as clearly separated.
Some events collect at the package level, yet they are using core PMU
counters. And those come with restrictions as well see Section 3.12,
description of PERFCTL, in the BKDG for Family 10h.

> If, despite our expecations, uncore events prove to be useful,
> popular and required elements of performance analysis, they can be
> supported in perfcounters via various levels:
> Â- a special raw ID range on x86, only to per CPU counters. The
> Â low-level implementation reserves the uncore PMCs, so overlapping
> Â allocation (and interaction between the cores via the MSRs) is
> Â not possible.
I agree this is for CPU counters only, not per-thread. It could be any
core in the package. In fact, multiple per CPU "sessions" could
co-exist in the same package. But there is one difficulty with allowing
this, though. The uncore does not interrupt directly. You need to
designate which core(s) it will interrupt via a bitmask. It could interrupt
ALL CPUs in the package at once (which is another interesting usage
model of uncore). So I believe the choice is between 1 CPU and
all CPUs.

Uncore events have no constraints, except for the single fixed counter
event (UNC_CLK_UNHALTED). Thus, you could still use your
event model and overcommit the uncore and multiplex groups on it.
You could reject events in a group once you reach 8 (max number of
counters). I don't see the difference there. The only issue is with managing
the interrupt.

> Â- generic enumeration with full tooling support, time-sharing and
> Â the whole set of features. The low-level backend would time-share
> Â the resource between interested CPUs.
> There is no limitation in the perfcounters design that somehow makes
> uncore events harder to support. The uncore counters _themselves_
> are limited to begin with - so rich features cannot be offered on
> top of them.
I would say they are limited. This is what you can do from where they
are sourced from.

>> The current code-base does not have any constrained event support,
>> therefore bogus counts may be returned depending on the event
>> measured.
> Then we'll need to grow some when we run into them :-)

FYI, here is the list of constrained events for Intel Core.
Counter [0] means generic counter0, [1] means generic counter1.
If you do not put these events in the right counter, they do
not count what they are supposed to, and do so silently.

Code : 0x10
Counters : [ 0 ]
Desc : Floating point computational micro-ops executed

Code : 0x11
Counters : [ 1 ]
Desc : Floating point assists

Name : MUL
Code : 0x12
Counters : [ 1 ]
Desc : Multiply operations executed

Name : DIV
Code : 0x13
Counters : [ 1 ]
Desc : Divide operations executed

Code : 0x14
Counters : [ 0 ]
Desc : Cycles the divider is busy

Code : 0x18
Counters : [ 0 ]
Desc : Cycles the divider is busy and all other execution units are idle

Code : 0x19
Counters : [ 1 ]
Desc : Delayed bypass
Umask-00 : 0x00 : [FP] : Delayed bypass to FP operation
Umask-01 : 0x01 : [SIMD] : Delayed bypass to SIMD operation
Umask-02 : 0x02 : [LOAD] : Delayed bypass to load operation

Code : 0xcb
Counters : [ 0 ]
Desc : Retired loads that miss the L1 data cache
Umask-00 : 0x01 : [L1D_MISS] : Retired loads that miss the L1 data
cache (precise event)
Umask-01 : 0x02 : [L1D_LINE_MISS] : L1 data cache line missed by
retired loads (precise event)
Umask-02 : 0x04 : [L2_MISS] : Retired loads that miss the L2 cache
(precise event)
Umask-03 : 0x08 : [L2_LINE_MISS] : L2 cache line missed by retired
loads (precise event)
Umask-04 : 0x10 : [DTLB_MISS] : Retired loads that miss the DTLB (precise event)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at