comments on Performance Counters for Linux (PCL)

From: stephane eranian
Date: Thu May 28 2009 - 10:58:57 EST


The following sections are some preliminary comments concerning the
Performance Counter for Linux (PCL) API and implementation proposal
currently in development.


I/ General API comments

 1/ Data structures

  Â* struct perf_counter_hw_event

  Â- I think this structure will be used to enable non-counting features,
   Âe.g. IBS. Name is awkward. It is used to enable non-hardware events
   Â(sw events). Why not call it: struct perf_event

  Â- uint64_t config

   ÂWhy use a single field to encode event type and its encoding? By design,
   Âthe syscall is not in the critical path. Why not spell things out
   Âclearly: int type, uint64_t code.

  Â- uint64_t irq_period

   ÂIRQ is an x86 related name. Why not use smpl_period instead?

  Â- uint32_t record_type

   ÂThis field is a bitmask. I believe 32-bit is too small to accommodate
   Âfuture record formats.

  Â- uint32_t read_format


  Â- uint64_t nmi

   ÂThis is an X86-only feature. Why make this visible in a generic API?

   ÂWhat is the semantic of this?

   ÂI cannot have one counter use NMI and another not use NMI or are you
   Âplanning on switching the interrupt vector when you change event groups?

   ÂWhy do I need to be a priviledged user to enable NMI? Especially given
       Â- non-privileged users can monitor at privilege level 0 (kernel).
       Â- there is interrupt throttling

  Â- uint64_t exclude_*

   ÂIt seems those fields were added to support the generic HW events. But
   ÂI think they are confusing and their semantic is not quite clear.

   ÂFurthermore, aren't they irrelevant for the SW events?

   ÂWhat is the meaning of exclude_user? Which priv levels are actually

   ÂTake Itanium, it has 4 priv levels and the PMU counters can monitor at
   Âany priv levels or combination thereof?

   ÂWhen programming raw HW events, the priv level filtering is typically
   Âalready included. Which setting has priority, raw encoding or the

   ÂLooking at the existing X86 implementation, it seems exclude_* can
   Âoverride whatever is set in the raw event code.

   ÂFor any events, but in particular, SW events, why not encode this in
   Âthe config field, like it is for a raw HW event?

  Â- mmap, munmap, comm

   ÂIt is not clear to me why those fields are defined here rather than as
   ÂPERF_RECORD_*. They are stored in the event buffer only. They are only
   Âuseful when sampling.

   ÂIt is not clear why you have mmap and munmap as separate options.
   ÂWhat's the point of munmap-only notification?

  Â* enum perf_event_types vs. enum perf_event_type

   ÂBoth names are too close to each other, yet they define unrelated data
   Âstructures. This is very confusing.

  Â* struct perf_counter_mmap_page

   ÂThe definition of data_head precludes sampling buffers bigger that 4GB.

   ÂDoes that makes sense on TB machines?

   ÂGiven there is only one counter per-page, there is an awful lot of
   Âprecious RLIMIT_MEMLOCK space wasted for this.

   ÂTypically, if you are self-sampling, you are not going to read the
   Âcurrent value of the sampling period. That re-mapping trick is only
   Âuseful when counting.

   ÂWhy not make these two separate mappings (using the mmap offset as
   Âthe indicator)?

   ÂWith this approach, you would get one page back per sampling period
   Âand that page could then be used for the actual samples.

Â2/ System calls

  Â* ioctl()

   ÂYou have defined 3 ioctls() so far to operate on an existing event.
   ÂI was under the impression that ioctl() should not be used except for

  Â* prctl()

   ÂThe API is event-based. Each event gets a file descriptor. Events are
   Âtherefore managed individually. Thus, to enable/disable, you need to
   Âenable/disable each one separately.

   ÂThe use of prctl() breaks this design choice. It is not clear what you
   Âare actually enabling. It looks like you are enabling all the counters
   Âattached to the thread. This is incorrect. With your implementation,
   Âthe PMU can be shared between competing users. In particular, multiple
   Âtools may be monitoring the same thread. Now, imagine, a tool is
   Âmonitoring a self-monitoring thread which happens to start/stop its
   Âmeasurement using prctl(). Then, that would also start/stop the
   Âmeasurement of the external tool. I have verified that this is what is
   Âactually happening.

   ÂI believe this call is bogus and it should be eliminated. The interface
   Âis exposing events individually therefore they should be controlled

Â3/ Counter width

   ÂIt is not clear whether or not the API exposes counters as 64-bit wide
   Âon PMUs which do not implement 64-bit wide counters.

   ÂBoth irq_period and read() return 64-bit integers. However, it appears
   Âthat the implementation is not using all the bits. In fact, on X86, it
   Âappears the irq_period is truncated silently. I believe this is not
   Âcorrect. If the period is not valid, an error should be returned.
   ÂOtherwise, the tool will be getting samples at a rate different than
   Âwhat it requested.

   ÂI would assume that on the read() side, counts are accumulated as
   Â64-bit integers. But if it is the case, then it seems there is an
   Âasymmetry between period and counts.

   ÂGiven that your API is high level, I don't think tools should have to
   Âworry about the actual width of a counter. This is especially true
   Âbecause they don't know which counters the event is going to go into
   Âand if I recall correctly, on some PMU models, different counters can
   Âhave different width (Power, I think).

   ÂIt is rather convenient for tools to always manipulate counters as
   Â64-bit integers. You should provide a consistent view between counts
   Âand periods.

Â4/ Grouping

   ÂBy design, an event can only be part of one group at a time. Events in
   Âa group are guaranteed to be active on the PMU at the same time. That
   Âmeans a group cannot have more events than there are available counters
   Âon the PMU. Tools may want to know the number of counters available in
   Âorder to group their events accordingly, such that reliable ratios
   Âcould be computed. It seems the only way to know this is by trial and
   Âerror. This is not practical.

Â5/ Multiplexing and scaling

   ÂThe PMU can be shared by multiple programs each controlling a variable
   Ânumber of events. Multiplexing occurs by default unless pinned is
   Ârequested. The exclusive option only guarantees the group does not
   Âshare the PMU with other groups while it is active, at least this is
   Âmy understanding.

   ÂBy default, you may be multiplexed and if that happens you cannot know
   Âunless you request the timing information as part of the read_format.
   ÂWithout it, and if multiplexing has occurred, bogus counts may be
   Âreturned with no indication whatsoever.

   ÂTo avoid returning misleading information, it seems like the API should
   Ârefuse to open a non-pinned event which does not have
   Âread_format. This would avoid a lot of confusion down the road.

Â7/ Multiplexing and system-wide

   ÂMultiplexing is time-based and it is hooked into the timer tick. At
   Âevery tick, the kernel tries to schedule another group of events.

   ÂIn tickless kernels if a CPU is idle, no timer tick is generated,
   Âtherefore no multiplexing occurs. This is incorrect. It's not because
   Âthe CPU is idle, that there aren't any interesting PMU events to measure.
   ÂParts of the CPU may still be active, e.g., caches and buses. And thus,
   Âit is expected that multiplexing still happens.

   ÂYou need to hook up the timer source for multiplexing to something else
   Âwhich is not affected by tickless.

Â8/ Controlling group multiplexing

   ÂAlthough, multiplexing is somehow exposed to user via the timing
   Âinformation. ÂI believe there is not enough control. I know of advanced
   Âmonitoring tools which needs to measure over a dozen events in one
   Âmonitoring session. Given that the underlying PMU does not have enough
   Âcounters OR that certain events cannot be measured together, it is
   Ânecessary to split the events into groups and multiplex them. Events
   Âare not grouped at random AND groups are not ordered at random either.
   ÂThe sequence of groups is carefully chosen such that related events are
   Âin neighboring groups such that they measure similar parts of the
   Âexecution. ÂThis way you can mitigate the fluctuations introduced by
   Âmultiplexing and compare ratios. In other words, some tools may want to
   Âcontrol the order in which groups are scheduled on the PMU.

   ÂThe exclusive flag ensures correct grouping. But there is nothing to
   Âcontrol ordering of groups. ÂThat is a problem for some tools. Groups
   Âfrom different 'session' may be interleaved and break the continuity of

   ÂThe group ordering has to be controllable from the tools OR must be
   Âfully specified by the API. But it should not be a property of the
   Âimplementation. The API could for instance specify that groups are
   Âscheduled in increasing order of the group leaders' file descriptor.
   ÂThere needs to be some way of preventing interleaving of groups from
   Âdifferent 'sessions'.

Â9/ Event buffer

   ÂThere is a kernel level event buffer which can be re-mapped read-only at
   Âthe user level via mmap(). The buffer must be a multiple of page size
   Âand must be at least 2-page long. The First page is used for the
   Âcounter re-mapping and buffer header, the second for the actual event

   ÂThe buffer is managed as a cyclic buffer. That means there is a
   Âcontinuous race between the tool and the kernel. The tool must parse
   Âthe buffer faster than the kernel can fill it out. It is important to
   Ârealize that the race continues even when monitoring is stopped, as non
   ÂPMU-based infos keep being stored, such as mmap, munmap. This is
   Âexpected because it is not possible to lose mapping information
   Âotherwise invalid correlation of samples may happen.

   ÂHowever, there is currently no reliable way of figuring out whether or
   Ânot the buffer has wrapped around since the last scan by the tool. Just
   Âchecking the current position or estimating the space left is not good
   Âenough. There ought to be an overflow counter of some sort indicating
   Âthe number of times the head wrapped around.

 10/ Group event buffer entry

   ÂThis is activated by setting the PERF_RECORD_GROUP in the record_type
   Âfield. ÂWith this bit set, the values of the other members of the
   Âgroup are stored sequentially in the buffer. To help figure out which
   Âvalue corresponds to which event, the current implementation also
   Âstores the raw encoding of the event.

   ÂThe event encoding does not help figure out which event the value refers
   Âto. There can be multiple events with the same code. This does fit the
   ÂAPI model where events are identified by file descriptors.

   ÂThe file descriptor must be provided and not the raw encoding.

 11/ reserve_percpu

   ÂThere are more than counters on many PMU models. Counters are not
   Âsymmetrical even on X86.

   ÂWhat does this API actually guarantees in terms on what events a tool
   Âwill be able to measure with the reserved counters?

II/ X86 comments

 Mostly implementation related comments in this section.

 1/ Fixed counter and event on Intel

   ÂYou cannot simply fall back to generic counters if you cannot find
   Âa fixed counter. There are model-specific bugs, for instance
   ÂUNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
   ÂNehalem when it is used in fixed counter 2 or a generic counter. The
   Âsame is true on Core.

   ÂYou cannot simply look at the event field code to determine whether
   Âthis is an event supported by a fixed counters. You must look at the
   Âother fields such as edge, invert, cnt-mask. If those are present then
   Âyou have to fall back to using a generic counter as fixed counters only
   Âsupport priv level filtering. As indicated above, though, the
   Âprogramming UNHALTED_REFERENCE_CYCLES on a generic counter does not
   Âcount the same thing, therefore you need to fail is filters other than
   Âpriv levels are present on this event.

 2/ Event knowledge missing

   ÂThere are constraints and bugs on some events in Intel Core and Nehalem.
   ÂIn your model, those need to be taken care of by the kernel. Should the
   Âkernel make the wrong decision, there would be no work-around for user
   Âtools. Take the example I outlined just above with Intel fixed counters.

   ÂConstraints do exist on AMD64 processors as well.

 3/ Interrupt throttling

   ÂThere is apparently no way for a system admin to set the threshold. It
   Âis hardcoded.

   ÂThrottling occurs without the tool(s) knowing. I think this is a problem.

 Â4/ NMI

   ÂWhy restrict NMI to privileged users when you have throttling to protect
   Âagainst interrupt flooding?

   ÂAre you trying to restrict non privileged users from getting sampling
   Âinside kernel critical sections?

III/ Requests

 1/ Sampling period change

   ÂAs it stands today, it seems there is no way to change a period but to
   Âclose() the event file descriptor and start over. When you close the
   Âgroup leader, it is not clear to me what happens to the remaining events.

   ÂI know of tools which want to adjust the sampling period based on the
   Ânumber of samples they get per second.

   ÂBy design, your perf_counter_open() should not really be in the
   Âcritical path, e.g., when you are processing samples from the event
   Âbuffer. Thus, I think it would be good to have a dedicated call to
   Âallow changing the period.

 2/ Sampling period randomization

   ÂIt is our experience (on Itanium, for instance), that for certain
   Âsampling measurements, it is beneficial to randomize the sampling
   Âperiod a bit. This is in particular the case when sampling on an
   Âevent that happens very frequently and which is not related to
   Âtiming, e.g., branch_instructions_retired. Randomization helps mitigate
   Âthe bias. You do not need anything sophisticated. But when you are using
   Âa kernel-level sampling buffer, you need to have to kernel randomize.
   ÂRandomization needs to be supported per event.

 3/ Group multiplexing ordering

   ÂAs mentioned above, the ordering of group multiplexing for one process
   Âneeds to be either specified by the API or controllable by users.

IV/ Open questions

 1/ Support for model-specific uncore PMU monitoring capabilities

   ÂRecent processors have multiple PMUs. Typically one per core and but
   Âalso one at the socket level, e.g., Intel Nehalem. It is expected that
   Âthis API will provide access to these PMU as well.

   ÂIt seems like with the current API, raw events for those PMUs would need
   Âa new architecture-specific type as the event encoding by itself may
   Ânot be enough to disambiguate between a core and uncore PMU event.

   ÂHow are those events going to be supported?

 2/ Features impacting all counters

   ÂOn some PMU models, e.g., Itanium, they are certain features which have
   Âan influence on all counters that are active. For instance, there is a
   Âway to restrict monitoring to a range of continuous code or data
   Âaddresses using both some PMU registers and the debug registers.

   ÂGiven that the API exposes events (counters) as independent of each
   Âother, I wonder how range restriction could be implemented.

   ÂSimilarly, on Itanium, there are global behaviors. For instance, on
   Âcounter overflow the entire PMU freezes all at once. That seems to be
   Âcontradictory with the design of the API which creates the illusion of

   ÂWhat solutions do you propose?


   ÂHow is AMD IBS going to be implemented?

   ÂIBS has two separate sets of registers. One to capture fetch related
   Âdata and another one to capture instruction execution data. For each,
   Âthere is one config register but multiple data registers. In each mode,
   Âthere is a specific sampling period and IBS can interrupt.

   ÂIt looks like you could define two pseudo events or event types and then
   Âdefine a new record_format and read_format. ÂThat formats would only be
   Âvalid for an IBS event.

   ÂIs that how you intend to support IBS?

 4/ Intel PEBS

   ÂSince Netburst-based processors, Intel PMUs support a hardware sampling
   Âbuffer mechanism called PEBS.

   ÂPEBS really became useful with Nehalem.

   ÂNot all events support PEBS. Up until Nehalem, only one counter supported
   ÂPEBS (PMC0). The format of the hardware buffer has changed between Core
   Âand Nehalem. It is not yet architected, thus it can still evolve with
   Âfuture PMU models.

   ÂOn Nehalem, there is a new PEBS-based feature called Load Latency
   ÂFiltering which captures where data cache misses occur
   Â(similar to Itanium D-EAR). Activating this feature requires setting a
   Âlatency threshold hosted in a separate PMU MSR.

   ÂOn Nehalem, given that all 4 generic counters support PEBS, the
   Âsampling buffer may contain samples generated by any of the 4 counters.
   ÂThe buffer includes a bitmask of registers to determine the source
   Âof the samples. Multiple bits may be set in the bitmask.

   ÂHow PEBS will be supported for this new API?

 5/ Intel Last Branch Record (LBR)

   ÂIntel processors since Netburst have a cyclic buffer hosted in
   Âregisters which can record taken branches. Each taken branch is stored
   Âinto a pair of LBR registers (source, destination). Up until Nehalem,
   Âthere was not filtering capabilities for LBR. LBR is not an architected
   ÂPMU feature.

   ÂThere is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
   ÂHowever there are some constraints on it given it is shared by threads.

   ÂLBR is only useful when sampling and therefore must be combined with a
   Âcounter. LBR must also be configured to freeze on PMU interrupt.

   ÂHow is LBR going to be supported?
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at