comments on Performance Counters for Linux (PCL)
From: stephane eranian
Date: Thu May 28 2009 - 10:58:57 EST
The following sections are some preliminary comments concerning the
Performance Counter for Linux (PCL) API and implementation proposal
currently in development.
I/ General API comments
Â 1/ Data structures
Â Â Â* struct perf_counter_hw_event
Â Â Â- I think this structure will be used to enable non-counting features,
Â Â Â Âe.g. IBS. Name is awkward. It is used to enable non-hardware events
Â Â Â Â(sw events). Why not call it: struct perf_event
Â Â Â- uint64_t config
Â Â Â ÂWhy use a single field to encode event type and its encoding? By design,
Â Â Â Âthe syscall is not in the critical path. Why not spell things out
Â Â Â Âclearly: int type, uint64_t code.
Â Â Â- uint64_t irq_period
Â Â Â ÂIRQ is an x86 related name. Why not use smpl_period instead?
Â Â Â- uint32_t record_type
Â Â Â ÂThis field is a bitmask. I believe 32-bit is too small to accommodate
Â Â Â Âfuture record formats.
Â Â Â- uint32_t read_format
Â Â Â ÂDitto.
Â Â Â- uint64_t nmi
Â Â Â ÂThis is an X86-only feature. Why make this visible in a generic API?
Â Â Â ÂWhat is the semantic of this?
Â Â Â ÂI cannot have one counter use NMI and another not use NMI or are you
Â Â Â Âplanning on switching the interrupt vector when you change event groups?
Â Â Â ÂWhy do I need to be a priviledged user to enable NMI? Especially given
Â Â Â Âthat:
Â Â Â Â Â Â Â Â- non-privileged users can monitor at privilege level 0 (kernel).
Â Â Â Â Â Â Â Â- there is interrupt throttling
Â Â Â- uint64_t exclude_*
Â Â Â ÂIt seems those fields were added to support the generic HW events. But
Â Â Â ÂI think they are confusing and their semantic is not quite clear.
Â Â Â ÂFurthermore, aren't they irrelevant for the SW events?
Â Â Â ÂWhat is the meaning of exclude_user? Which priv levels are actually
Â Â Â Âexcluded?
Â Â Â ÂTake Itanium, it has 4 priv levels and the PMU counters can monitor at
Â Â Â Âany priv levels or combination thereof?
Â Â Â ÂWhen programming raw HW events, the priv level filtering is typically
Â Â Â Âalready included. Which setting has priority, raw encoding or the
Â Â Â Âexclude_*?
Â Â Â ÂLooking at the existing X86 implementation, it seems exclude_* can
Â Â Â Âoverride whatever is set in the raw event code.
Â Â Â ÂFor any events, but in particular, SW events, why not encode this in
Â Â Â Âthe config field, like it is for a raw HW event?
Â Â Â- mmap, munmap, comm
Â Â Â ÂIt is not clear to me why those fields are defined here rather than as
Â Â Â ÂPERF_RECORD_*. They are stored in the event buffer only. They are only
Â Â Â Âuseful when sampling.
Â Â Â ÂIt is not clear why you have mmap and munmap as separate options.
Â Â Â ÂWhat's the point of munmap-only notification?
Â Â Â* enum perf_event_types vs. enum perf_event_type
Â Â Â ÂBoth names are too close to each other, yet they define unrelated data
Â Â Â Âstructures. This is very confusing.
Â Â Â* struct perf_counter_mmap_page
Â Â Â ÂThe definition of data_head precludes sampling buffers bigger that 4GB.
Â Â Â ÂDoes that makes sense on TB machines?
Â Â Â ÂGiven there is only one counter per-page, there is an awful lot of
Â Â Â Âprecious RLIMIT_MEMLOCK space wasted for this.
Â Â Â ÂTypically, if you are self-sampling, you are not going to read the
Â Â Â Âcurrent value of the sampling period. That re-mapping trick is only
Â Â Â Âuseful when counting.
Â Â Â ÂWhy not make these two separate mappings (using the mmap offset as
Â Â Â Âthe indicator)?
Â Â Â ÂWith this approach, you would get one page back per sampling period
Â Â Â Âand that page could then be used for the actual samples.
Â2/ System calls
Â Â Â* ioctl()
Â Â Â ÂYou have defined 3 ioctls() so far to operate on an existing event.
Â Â Â ÂI was under the impression that ioctl() should not be used except for
Â Â Â Âdrivers.
Â Â Â* prctl()
Â Â Â ÂThe API is event-based. Each event gets a file descriptor. Events are
Â Â Â Âtherefore managed individually. Thus, to enable/disable, you need to
Â Â Â Âenable/disable each one separately.
Â Â Â ÂThe use of prctl() breaks this design choice. It is not clear what you
Â Â Â Âare actually enabling. It looks like you are enabling all the counters
Â Â Â Âattached to the thread. This is incorrect. With your implementation,
Â Â Â Âthe PMU can be shared between competing users. In particular, multiple
Â Â Â Âtools may be monitoring the same thread. Now, imagine, a tool is
Â Â Â Âmonitoring a self-monitoring thread which happens to start/stop its
Â Â Â Âmeasurement using prctl(). Then, that would also start/stop the
Â Â Â Âmeasurement of the external tool. I have verified that this is what is
Â Â Â Âactually happening.
Â Â Â ÂI believe this call is bogus and it should be eliminated. The interface
Â Â Â Âis exposing events individually therefore they should be controlled
Â Â Â Âindividually.
Â3/ Counter width
Â Â Â ÂIt is not clear whether or not the API exposes counters as 64-bit wide
Â Â Â Âon PMUs which do not implement 64-bit wide counters.
Â Â Â ÂBoth irq_period and read() return 64-bit integers. However, it appears
Â Â Â Âthat the implementation is not using all the bits. In fact, on X86, it
Â Â Â Âappears the irq_period is truncated silently. I believe this is not
Â Â Â Âcorrect. If the period is not valid, an error should be returned.
Â Â Â ÂOtherwise, the tool will be getting samples at a rate different than
Â Â Â Âwhat it requested.
Â Â Â ÂI would assume that on the read() side, counts are accumulated as
Â Â Â Â64-bit integers. But if it is the case, then it seems there is an
Â Â Â Âasymmetry between period and counts.
Â Â Â ÂGiven that your API is high level, I don't think tools should have to
Â Â Â Âworry about the actual width of a counter. This is especially true
Â Â Â Âbecause they don't know which counters the event is going to go into
Â Â Â Âand if I recall correctly, on some PMU models, different counters can
Â Â Â Âhave different width (Power, I think).
Â Â Â ÂIt is rather convenient for tools to always manipulate counters as
Â Â Â Â64-bit integers. You should provide a consistent view between counts
Â Â Â Âand periods.
Â Â Â ÂBy design, an event can only be part of one group at a time. Events in
Â Â Â Âa group are guaranteed to be active on the PMU at the same time. That
Â Â Â Âmeans a group cannot have more events than there are available counters
Â Â Â Âon the PMU. Tools may want to know the number of counters available in
Â Â Â Âorder to group their events accordingly, such that reliable ratios
Â Â Â Âcould be computed. It seems the only way to know this is by trial and
Â Â Â Âerror. This is not practical.
Â5/ Multiplexing and scaling
Â Â Â ÂThe PMU can be shared by multiple programs each controlling a variable
Â Â Â Ânumber of events. Multiplexing occurs by default unless pinned is
Â Â Â Ârequested. The exclusive option only guarantees the group does not
Â Â Â Âshare the PMU with other groups while it is active, at least this is
Â Â Â Âmy understanding.
Â Â Â ÂBy default, you may be multiplexed and if that happens you cannot know
Â Â Â Âunless you request the timing information as part of the read_format.
Â Â Â ÂWithout it, and if multiplexing has occurred, bogus counts may be
Â Â Â Âreturned with no indication whatsoever.
Â Â Â ÂTo avoid returning misleading information, it seems like the API should
Â Â Â Ârefuse to open a non-pinned event which does not have
Â Â Â ÂPERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING in the
Â Â Â Âread_format. This would avoid a lot of confusion down the road.
Â7/ Multiplexing and system-wide
Â Â Â ÂMultiplexing is time-based and it is hooked into the timer tick. At
Â Â Â Âevery tick, the kernel tries to schedule another group of events.
Â Â Â ÂIn tickless kernels if a CPU is idle, no timer tick is generated,
Â Â Â Âtherefore no multiplexing occurs. This is incorrect. It's not because
Â Â Â Âthe CPU is idle, that there aren't any interesting PMU events to measure.
Â Â Â ÂParts of the CPU may still be active, e.g., caches and buses. And thus,
Â Â Â Âit is expected that multiplexing still happens.
Â Â Â ÂYou need to hook up the timer source for multiplexing to something else
Â Â Â Âwhich is not affected by tickless.
Â8/ Controlling group multiplexing
Â Â Â ÂAlthough, multiplexing is somehow exposed to user via the timing
Â Â Â Âinformation. ÂI believe there is not enough control. I know of advanced
Â Â Â Âmonitoring tools which needs to measure over a dozen events in one
Â Â Â Âmonitoring session. Given that the underlying PMU does not have enough
Â Â Â Âcounters OR that certain events cannot be measured together, it is
Â Â Â Ânecessary to split the events into groups and multiplex them. Events
Â Â Â Âare not grouped at random AND groups are not ordered at random either.
Â Â Â ÂThe sequence of groups is carefully chosen such that related events are
Â Â Â Âin neighboring groups such that they measure similar parts of the
Â Â Â Âexecution. ÂThis way you can mitigate the fluctuations introduced by
Â Â Â Âmultiplexing and compare ratios. In other words, some tools may want to
Â Â Â Âcontrol the order in which groups are scheduled on the PMU.
Â Â Â ÂThe exclusive flag ensures correct grouping. But there is nothing to
Â Â Â Âcontrol ordering of groups. ÂThat is a problem for some tools. Groups
Â Â Â Âfrom different 'session' may be interleaved and break the continuity of
Â Â Â Â measurement.
Â Â Â ÂThe group ordering has to be controllable from the tools OR must be
Â Â Â Âfully specified by the API. But it should not be a property of the
Â Â Â Âimplementation. The API could for instance specify that groups are
Â Â Â Âscheduled in increasing order of the group leaders' file descriptor.
Â Â Â ÂThere needs to be some way of preventing interleaving of groups from
Â Â Â Âdifferent 'sessions'.
Â9/ Event buffer
Â Â Â ÂThere is a kernel level event buffer which can be re-mapped read-only at
Â Â Â Âthe user level via mmap(). The buffer must be a multiple of page size
Â Â Â Âand must be at least 2-page long. The First page is used for the
Â Â Â Âcounter re-mapping and buffer header, the second for the actual event
Â Â Â Âbuffer.
Â Â Â ÂThe buffer is managed as a cyclic buffer. That means there is a
Â Â Â Âcontinuous race between the tool and the kernel. The tool must parse
Â Â Â Âthe buffer faster than the kernel can fill it out. It is important to
Â Â Â Ârealize that the race continues even when monitoring is stopped, as non
Â Â Â ÂPMU-based infos keep being stored, such as mmap, munmap. This is
Â Â Â Âexpected because it is not possible to lose mapping information
Â Â Â Âotherwise invalid correlation of samples may happen.
Â Â Â ÂHowever, there is currently no reliable way of figuring out whether or
Â Â Â Ânot the buffer has wrapped around since the last scan by the tool. Just
Â Â Â Âchecking the current position or estimating the space left is not good
Â Â Â Âenough. There ought to be an overflow counter of some sort indicating
Â Â Â Âthe number of times the head wrapped around.
Â 10/ Group event buffer entry
Â Â Â ÂThis is activated by setting the PERF_RECORD_GROUP in the record_type
Â Â Â Âfield. ÂWith this bit set, the values of the other members of the
Â Â Â Âgroup are stored sequentially in the buffer. To help figure out which
Â Â Â Âvalue corresponds to which event, the current implementation also
Â Â Â Âstores the raw encoding of the event.
Â Â Â ÂThe event encoding does not help figure out which event the value refers
Â Â Â Âto. There can be multiple events with the same code. This does fit the
Â Â Â ÂAPI model where events are identified by file descriptors.
Â Â Â ÂThe file descriptor must be provided and not the raw encoding.
Â 11/ reserve_percpu
Â Â Â ÂThere are more than counters on many PMU models. Counters are not
Â Â Â Âsymmetrical even on X86.
Â Â Â ÂWhat does this API actually guarantees in terms on what events a tool
Â Â Â Âwill be able to measure with the reserved counters?
II/ X86 comments
Â Mostly implementation related comments in this section.
Â 1/ Fixed counter and event on Intel
Â Â Â ÂYou cannot simply fall back to generic counters if you cannot find
Â Â Â Âa fixed counter. There are model-specific bugs, for instance
Â Â Â ÂUNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
Â Â Â ÂNehalem when it is used in fixed counter 2 or a generic counter. The
Â Â Â Âsame is true on Core.
Â Â Â ÂYou cannot simply look at the event field code to determine whether
Â Â Â Âthis is an event supported by a fixed counters. You must look at the
Â Â Â Âother fields such as edge, invert, cnt-mask. If those are present then
Â Â Â Âyou have to fall back to using a generic counter as fixed counters only
Â Â Â Âsupport priv level filtering. As indicated above, though, the
Â Â Â Âprogramming UNHALTED_REFERENCE_CYCLES on a generic counter does not
Â Â Â Âcount the same thing, therefore you need to fail is filters other than
Â Â Â Âpriv levels are present on this event.
Â 2/ Event knowledge missing
Â Â Â ÂThere are constraints and bugs on some events in Intel Core and Nehalem.
Â Â Â ÂIn your model, those need to be taken care of by the kernel. Should the
Â Â Â Âkernel make the wrong decision, there would be no work-around for user
Â Â Â Âtools. Take the example I outlined just above with Intel fixed counters.
Â Â Â ÂConstraints do exist on AMD64 processors as well.
Â 3/ Interrupt throttling
Â Â Â ÂThere is apparently no way for a system admin to set the threshold. It
Â Â Â Âis hardcoded.
Â Â Â ÂThrottling occurs without the tool(s) knowing. I think this is a problem.
Â Â4/ NMI
Â Â Â ÂWhy restrict NMI to privileged users when you have throttling to protect
Â Â Â Âagainst interrupt flooding?
Â Â Â ÂAre you trying to restrict non privileged users from getting sampling
Â Â Â Âinside kernel critical sections?
Â 1/ Sampling period change
Â Â Â ÂAs it stands today, it seems there is no way to change a period but to
Â Â Â Âclose() the event file descriptor and start over. When you close the
Â Â Â Âgroup leader, it is not clear to me what happens to the remaining events.
Â Â Â ÂI know of tools which want to adjust the sampling period based on the
Â Â Â Ânumber of samples they get per second.
Â Â Â ÂBy design, your perf_counter_open() should not really be in the
Â Â Â Âcritical path, e.g., when you are processing samples from the event
Â Â Â Âbuffer. Thus, I think it would be good to have a dedicated call to
Â Â Â Âallow changing the period.
Â 2/ Sampling period randomization
Â Â Â ÂIt is our experience (on Itanium, for instance), that for certain
Â Â Â Âsampling measurements, it is beneficial to randomize the sampling
Â Â Â Âperiod a bit. This is in particular the case when sampling on an
Â Â Â Âevent that happens very frequently and which is not related to
Â Â Â Âtiming, e.g., branch_instructions_retired. Randomization helps mitigate
Â Â Â Âthe bias. You do not need anything sophisticated. But when you are using
Â Â Â Âa kernel-level sampling buffer, you need to have to kernel randomize.
Â Â Â ÂRandomization needs to be supported per event.
Â 3/ Group multiplexing ordering
Â Â Â ÂAs mentioned above, the ordering of group multiplexing for one process
Â Â Â Âneeds to be either specified by the API or controllable by users.
IV/ Open questions
Â 1/ Support for model-specific uncore PMU monitoring capabilities
Â Â Â ÂRecent processors have multiple PMUs. Typically one per core and but
Â Â Â Âalso one at the socket level, e.g., Intel Nehalem. It is expected that
Â Â Â Âthis API will provide access to these PMU as well.
Â Â Â ÂIt seems like with the current API, raw events for those PMUs would need
Â Â Â Âa new architecture-specific type as the event encoding by itself may
Â Â Â Ânot be enough to disambiguate between a core and uncore PMU event.
Â Â Â ÂHow are those events going to be supported?
Â 2/ Features impacting all counters
Â Â Â ÂOn some PMU models, e.g., Itanium, they are certain features which have
Â Â Â Âan influence on all counters that are active. For instance, there is a
Â Â Â Âway to restrict monitoring to a range of continuous code or data
Â Â Â Âaddresses using both some PMU registers and the debug registers.
Â Â Â ÂGiven that the API exposes events (counters) as independent of each
Â Â Â Âother, I wonder how range restriction could be implemented.
Â Â Â ÂSimilarly, on Itanium, there are global behaviors. For instance, on
Â Â Â Âcounter overflow the entire PMU freezes all at once. That seems to be
Â Â Â Âcontradictory with the design of the API which creates the illusion of
Â Â Â Âindependence.
Â Â Â ÂWhat solutions do you propose?
Â 3/ AMD IBS
Â Â Â ÂHow is AMD IBS going to be implemented?
Â Â Â ÂIBS has two separate sets of registers. One to capture fetch related
Â Â Â Âdata and another one to capture instruction execution data. For each,
Â Â Â Âthere is one config register but multiple data registers. In each mode,
Â Â Â Âthere is a specific sampling period and IBS can interrupt.
Â Â Â ÂIt looks like you could define two pseudo events or event types and then
Â Â Â Âdefine a new record_format and read_format. ÂThat formats would only be
Â Â Â Âvalid for an IBS event.
Â Â Â ÂIs that how you intend to support IBS?
Â 4/ Intel PEBS
Â Â Â ÂSince Netburst-based processors, Intel PMUs support a hardware sampling
Â Â Â Âbuffer mechanism called PEBS.
Â Â Â ÂPEBS really became useful with Nehalem.
Â Â Â ÂNot all events support PEBS. Up until Nehalem, only one counter supported
Â Â Â ÂPEBS (PMC0). The format of the hardware buffer has changed between Core
Â Â Â Âand Nehalem. It is not yet architected, thus it can still evolve with
Â Â Â Âfuture PMU models.
Â Â Â ÂOn Nehalem, there is a new PEBS-based feature called Load Latency
Â Â Â ÂFiltering which captures where data cache misses occur
Â Â Â Â(similar to Itanium D-EAR). Activating this feature requires setting a
Â Â Â Âlatency threshold hosted in a separate PMU MSR.
Â Â Â ÂOn Nehalem, given that all 4 generic counters support PEBS, the
Â Â Â Âsampling buffer may contain samples generated by any of the 4 counters.
Â Â Â ÂThe buffer includes a bitmask of registers to determine the source
Â Â Â Âof the samples. Multiple bits may be set in the bitmask.
Â Â Â ÂHow PEBS will be supported for this new API?
Â 5/ Intel Last Branch Record (LBR)
Â Â Â ÂIntel processors since Netburst have a cyclic buffer hosted in
Â Â Â Âregisters which can record taken branches. Each taken branch is stored
Â Â Â Âinto a pair of LBR registers (source, destination). Up until Nehalem,
Â Â Â Âthere was not filtering capabilities for LBR. LBR is not an architected
Â Â Â ÂPMU feature.
Â Â Â ÂThere is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
Â Â Â ÂHowever there are some constraints on it given it is shared by threads.
Â Â Â ÂLBR is only useful when sampling and therefore must be combined with a
Â Â Â Âcounter. LBR must also be configured to freeze on PMU interrupt.
Â Â Â ÂHow is LBR going to be supported?
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/