comments on Performance Counters for Linux (PCL)

From: stephane eranian
Date: Thu May 28 2009 - 10:58:57 EST

Next message: Steven Rostedt: "Re: [PATCH] tracing: annotate emit_log_char() notrace"
Previous message: Wu Zhangjin: "Re: [PATCH] mips-specific ftrace support"
In reply to: eranian: "comments on Performance Counters for Linux (PCL)"
Next in thread: Peter Zijlstra: "Re: comments on Performance Counters for Linux (PCL)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

The following sections are some preliminary comments concerning the
Performance Counter for Linux (PCL) API and implementation proposal
currently in development.

S.Eranian
eranian@xxxxxxxxx

I/ General API comments

Â 1/ Data structures

Â Â Â* struct perf_counter_hw_event

Â Â Â- I think this structure will be used to enable non-counting features,
Â Â Â Âe.g. IBS. Name is awkward. It is used to enable non-hardware events
Â Â Â Â(sw events). Why not call it: struct perf_event

Â Â Â- uint64_t config

Â Â Â ÂWhy use a single field to encode event type and its encoding? By design,
Â Â Â Âthe syscall is not in the critical path. Why not spell things out
Â Â Â Âclearly: int type, uint64_t code.

Â Â Â- uint64_t irq_period

Â Â Â ÂIRQ is an x86 related name. Why not use smpl_period instead?

Â Â Â- uint32_t record_type

Â Â Â ÂThis field is a bitmask. I believe 32-bit is too small to accommodate
Â Â Â Âfuture record formats.

Â Â Â- uint32_t read_format

Â Â Â ÂDitto.

Â Â Â- uint64_t nmi

Â Â Â ÂThis is an X86-only feature. Why make this visible in a generic API?

Â Â Â ÂWhat is the semantic of this?

Â Â Â ÂI cannot have one counter use NMI and another not use NMI or are you
Â Â Â Âplanning on switching the interrupt vector when you change event groups?

Â Â Â ÂWhy do I need to be a priviledged user to enable NMI? Especially given
Â Â Â Âthat:
Â Â Â Â Â Â Â Â- non-privileged users can monitor at privilege level 0 (kernel).
Â Â Â Â Â Â Â Â- there is interrupt throttling

Â Â Â- uint64_t exclude_*

Â Â Â ÂIt seems those fields were added to support the generic HW events. But
Â Â Â ÂI think they are confusing and their semantic is not quite clear.

Â Â Â ÂFurthermore, aren't they irrelevant for the SW events?

Â Â Â ÂWhat is the meaning of exclude_user? Which priv levels are actually
Â Â Â Âexcluded?

Â Â Â ÂTake Itanium, it has 4 priv levels and the PMU counters can monitor at
Â Â Â Âany priv levels or combination thereof?

Â Â Â ÂWhen programming raw HW events, the priv level filtering is typically
Â Â Â Âalready included. Which setting has priority, raw encoding or the
Â Â Â Âexclude_*?

Â Â Â ÂLooking at the existing X86 implementation, it seems exclude_* can
Â Â Â Âoverride whatever is set in the raw event code.

Â Â Â ÂFor any events, but in particular, SW events, why not encode this in
Â Â Â Âthe config field, like it is for a raw HW event?

Â Â Â- mmap, munmap, comm

Â Â Â ÂIt is not clear to me why those fields are defined here rather than as
Â Â Â ÂPERF_RECORD_*. They are stored in the event buffer only. They are only
Â Â Â Âuseful when sampling.

Â Â Â ÂIt is not clear why you have mmap and munmap as separate options.
Â Â Â ÂWhat's the point of munmap-only notification?

Â Â Â* enum perf_event_types vs. enum perf_event_type

Â Â Â ÂBoth names are too close to each other, yet they define unrelated data
Â Â Â Âstructures. This is very confusing.

Â Â Â* struct perf_counter_mmap_page

Â Â Â ÂThe definition of data_head precludes sampling buffers bigger that 4GB.

Â Â Â ÂDoes that makes sense on TB machines?

Â Â Â ÂGiven there is only one counter per-page, there is an awful lot of
Â Â Â Âprecious RLIMIT_MEMLOCK space wasted for this.

Â Â Â ÂTypically, if you are self-sampling, you are not going to read the
Â Â Â Âcurrent value of the sampling period. That re-mapping trick is only
Â Â Â Âuseful when counting.

Â Â Â ÂWhy not make these two separate mappings (using the mmap offset as
Â Â Â Âthe indicator)?

Â Â Â ÂWith this approach, you would get one page back per sampling period
Â Â Â Âand that page could then be used for the actual samples.

Â2/ System calls

Â Â Â* ioctl()

Â Â Â ÂYou have defined 3 ioctls() so far to operate on an existing event.
Â Â Â ÂI was under the impression that ioctl() should not be used except for
Â Â Â Âdrivers.

Â Â Â* prctl()

Â Â Â ÂThe API is event-based. Each event gets a file descriptor. Events are
Â Â Â Âtherefore managed individually. Thus, to enable/disable, you need to
Â Â Â Âenable/disable each one separately.

Â Â Â ÂThe use of prctl() breaks this design choice. It is not clear what you
Â Â Â Âare actually enabling. It looks like you are enabling all the counters
Â Â Â Âattached to the thread. This is incorrect. With your implementation,
Â Â Â Âthe PMU can be shared between competing users. In particular, multiple
Â Â Â Âtools may be monitoring the same thread. Now, imagine, a tool is
Â Â Â Âmonitoring a self-monitoring thread which happens to start/stop its
Â Â Â Âmeasurement using prctl(). Then, that would also start/stop the
Â Â Â Âmeasurement of the external tool. I have verified that this is what is
Â Â Â Âactually happening.

Â Â Â ÂI believe this call is bogus and it should be eliminated. The interface
Â Â Â Âis exposing events individually therefore they should be controlled
Â Â Â Âindividually.

Â3/ Counter width

Â Â Â ÂIt is not clear whether or not the API exposes counters as 64-bit wide
Â Â Â Âon PMUs which do not implement 64-bit wide counters.

Â Â Â ÂBoth irq_period and read() return 64-bit integers. However, it appears
Â Â Â Âthat the implementation is not using all the bits. In fact, on X86, it
Â Â Â Âappears the irq_period is truncated silently. I believe this is not
Â Â Â Âcorrect. If the period is not valid, an error should be returned.
Â Â Â ÂOtherwise, the tool will be getting samples at a rate different than
Â Â Â Âwhat it requested.

Â Â Â ÂI would assume that on the read() side, counts are accumulated as
Â Â Â Â64-bit integers. But if it is the case, then it seems there is an
Â Â Â Âasymmetry between period and counts.

Â Â Â ÂGiven that your API is high level, I don't think tools should have to
Â Â Â Âworry about the actual width of a counter. This is especially true
Â Â Â Âbecause they don't know which counters the event is going to go into
Â Â Â Âand if I recall correctly, on some PMU models, different counters can
Â Â Â Âhave different width (Power, I think).

Â Â Â ÂIt is rather convenient for tools to always manipulate counters as
Â Â Â Â64-bit integers. You should provide a consistent view between counts
Â Â Â Âand periods.

Â4/ Grouping

Â Â Â ÂBy design, an event can only be part of one group at a time. Events in
Â Â Â Âa group are guaranteed to be active on the PMU at the same time. That
Â Â Â Âmeans a group cannot have more events than there are available counters
Â Â Â Âon the PMU. Tools may want to know the number of counters available in
Â Â Â Âorder to group their events accordingly, such that reliable ratios
Â Â Â Âcould be computed. It seems the only way to know this is by trial and
Â Â Â Âerror. This is not practical.

Â5/ Multiplexing and scaling

Â Â Â ÂThe PMU can be shared by multiple programs each controlling a variable
Â Â Â Ânumber of events. Multiplexing occurs by default unless pinned is
Â Â Â Ârequested. The exclusive option only guarantees the group does not
Â Â Â Âshare the PMU with other groups while it is active, at least this is
Â Â Â Âmy understanding.

Â Â Â ÂBy default, you may be multiplexed and if that happens you cannot know
Â Â Â Âunless you request the timing information as part of the read_format.
Â Â Â ÂWithout it, and if multiplexing has occurred, bogus counts may be
Â Â Â Âreturned with no indication whatsoever.

Â Â Â ÂTo avoid returning misleading information, it seems like the API should
Â Â Â Ârefuse to open a non-pinned event which does not have
Â Â Â ÂPERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING in the
Â Â Â Âread_format. This would avoid a lot of confusion down the road.

Â7/ Multiplexing and system-wide

Â Â Â ÂMultiplexing is time-based and it is hooked into the timer tick. At
Â Â Â Âevery tick, the kernel tries to schedule another group of events.

Â Â Â ÂIn tickless kernels if a CPU is idle, no timer tick is generated,
Â Â Â Âtherefore no multiplexing occurs. This is incorrect. It's not because
Â Â Â Âthe CPU is idle, that there aren't any interesting PMU events to measure.
Â Â Â ÂParts of the CPU may still be active, e.g., caches and buses. And thus,
Â Â Â Âit is expected that multiplexing still happens.

Â Â Â ÂYou need to hook up the timer source for multiplexing to something else
Â Â Â Âwhich is not affected by tickless.

Â8/ Controlling group multiplexing

Â Â Â ÂAlthough, multiplexing is somehow exposed to user via the timing
Â Â Â Âinformation. ÂI believe there is not enough control. I know of advanced
Â Â Â Âmonitoring tools which needs to measure over a dozen events in one
Â Â Â Âmonitoring session. Given that the underlying PMU does not have enough
Â Â Â Âcounters OR that certain events cannot be measured together, it is
Â Â Â Ânecessary to split the events into groups and multiplex them. Events
Â Â Â Âare not grouped at random AND groups are not ordered at random either.
Â Â Â ÂThe sequence of groups is carefully chosen such that related events are
Â Â Â Âin neighboring groups such that they measure similar parts of the
Â Â Â Âexecution. ÂThis way you can mitigate the fluctuations introduced by
Â Â Â Âmultiplexing and compare ratios. In other words, some tools may want to
Â Â Â Âcontrol the order in which groups are scheduled on the PMU.

Â Â Â ÂThe exclusive flag ensures correct grouping. But there is nothing to
Â Â Â Âcontrol ordering of groups. ÂThat is a problem for some tools. Groups
Â Â Â Âfrom different 'session' may be interleaved and break the continuity of
Â Â Â Â measurement.

Â Â Â ÂThe group ordering has to be controllable from the tools OR must be
Â Â Â Âfully specified by the API. But it should not be a property of the
Â Â Â Âimplementation. The API could for instance specify that groups are
Â Â Â Âscheduled in increasing order of the group leaders' file descriptor.
Â Â Â ÂThere needs to be some way of preventing interleaving of groups from
Â Â Â Âdifferent 'sessions'.

Â9/ Event buffer

Â Â Â ÂThere is a kernel level event buffer which can be re-mapped read-only at
Â Â Â Âthe user level via mmap(). The buffer must be a multiple of page size
Â Â Â Âand must be at least 2-page long. The First page is used for the
Â Â Â Âcounter re-mapping and buffer header, the second for the actual event
Â Â Â Âbuffer.

Â Â Â ÂThe buffer is managed as a cyclic buffer. That means there is a
Â Â Â Âcontinuous race between the tool and the kernel. The tool must parse
Â Â Â Âthe buffer faster than the kernel can fill it out. It is important to
Â Â Â Ârealize that the race continues even when monitoring is stopped, as non
Â Â Â ÂPMU-based infos keep being stored, such as mmap, munmap. This is
Â Â Â Âexpected because it is not possible to lose mapping information
Â Â Â Âotherwise invalid correlation of samples may happen.

Â Â Â ÂHowever, there is currently no reliable way of figuring out whether or
Â Â Â Ânot the buffer has wrapped around since the last scan by the tool. Just
Â Â Â Âchecking the current position or estimating the space left is not good
Â Â Â Âenough. There ought to be an overflow counter of some sort indicating
Â Â Â Âthe number of times the head wrapped around.

Â 10/ Group event buffer entry

Â Â Â ÂThis is activated by setting the PERF_RECORD_GROUP in the record_type
Â Â Â Âfield. ÂWith this bit set, the values of the other members of the
Â Â Â Âgroup are stored sequentially in the buffer. To help figure out which
Â Â Â Âvalue corresponds to which event, the current implementation also
Â Â Â Âstores the raw encoding of the event.

Â Â Â ÂThe event encoding does not help figure out which event the value refers
Â Â Â Âto. There can be multiple events with the same code. This does fit the
Â Â Â ÂAPI model where events are identified by file descriptors.

Â Â Â ÂThe file descriptor must be provided and not the raw encoding.

Â 11/ reserve_percpu

Â Â Â ÂThere are more than counters on many PMU models. Counters are not
Â Â Â Âsymmetrical even on X86.

Â Â Â ÂWhat does this API actually guarantees in terms on what events a tool
Â Â Â Âwill be able to measure with the reserved counters?

II/ X86 comments

Â Mostly implementation related comments in this section.

Â 1/ Fixed counter and event on Intel

Â Â Â ÂYou cannot simply fall back to generic counters if you cannot find
Â Â Â Âa fixed counter. There are model-specific bugs, for instance
Â Â Â ÂUNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
Â Â Â ÂNehalem when it is used in fixed counter 2 or a generic counter. The
Â Â Â Âsame is true on Core.

Â Â Â ÂYou cannot simply look at the event field code to determine whether
Â Â Â Âthis is an event supported by a fixed counters. You must look at the
Â Â Â Âother fields such as edge, invert, cnt-mask. If those are present then
Â Â Â Âyou have to fall back to using a generic counter as fixed counters only
Â Â Â Âsupport priv level filtering. As indicated above, though, the
Â Â Â Âprogramming UNHALTED_REFERENCE_CYCLES on a generic counter does not
Â Â Â Âcount the same thing, therefore you need to fail is filters other than
Â Â Â Âpriv levels are present on this event.

Â 2/ Event knowledge missing

Â Â Â ÂThere are constraints and bugs on some events in Intel Core and Nehalem.
Â Â Â ÂIn your model, those need to be taken care of by the kernel. Should the
Â Â Â Âkernel make the wrong decision, there would be no work-around for user
Â Â Â Âtools. Take the example I outlined just above with Intel fixed counters.

Â Â Â ÂConstraints do exist on AMD64 processors as well.

Â 3/ Interrupt throttling

Â Â Â ÂThere is apparently no way for a system admin to set the threshold. It
Â Â Â Âis hardcoded.

Â Â Â ÂThrottling occurs without the tool(s) knowing. I think this is a problem.

Â Â4/ NMI

Â Â Â ÂWhy restrict NMI to privileged users when you have throttling to protect
Â Â Â Âagainst interrupt flooding?

Â Â Â ÂAre you trying to restrict non privileged users from getting sampling
Â Â Â Âinside kernel critical sections?

III/ Requests

Â 1/ Sampling period change

Â Â Â ÂAs it stands today, it seems there is no way to change a period but to
Â Â Â Âclose() the event file descriptor and start over. When you close the
Â Â Â Âgroup leader, it is not clear to me what happens to the remaining events.

Â Â Â ÂI know of tools which want to adjust the sampling period based on the
Â Â Â Ânumber of samples they get per second.

Â Â Â ÂBy design, your perf_counter_open() should not really be in the
Â Â Â Âcritical path, e.g., when you are processing samples from the event
Â Â Â Âbuffer. Thus, I think it would be good to have a dedicated call to
Â Â Â Âallow changing the period.

Â 2/ Sampling period randomization

Â Â Â ÂIt is our experience (on Itanium, for instance), that for certain
Â Â Â Âsampling measurements, it is beneficial to randomize the sampling
Â Â Â Âperiod a bit. This is in particular the case when sampling on an
Â Â Â Âevent that happens very frequently and which is not related to
Â Â Â Âtiming, e.g., branch_instructions_retired. Randomization helps mitigate
Â Â Â Âthe bias. You do not need anything sophisticated. But when you are using
Â Â Â Âa kernel-level sampling buffer, you need to have to kernel randomize.
Â Â Â ÂRandomization needs to be supported per event.

Â 3/ Group multiplexing ordering

Â Â Â ÂAs mentioned above, the ordering of group multiplexing for one process
Â Â Â Âneeds to be either specified by the API or controllable by users.

IV/ Open questions

Â 1/ Support for model-specific uncore PMU monitoring capabilities

Â Â Â ÂRecent processors have multiple PMUs. Typically one per core and but
Â Â Â Âalso one at the socket level, e.g., Intel Nehalem. It is expected that
Â Â Â Âthis API will provide access to these PMU as well.

Â Â Â ÂIt seems like with the current API, raw events for those PMUs would need
Â Â Â Âa new architecture-specific type as the event encoding by itself may
Â Â Â Ânot be enough to disambiguate between a core and uncore PMU event.

Â Â Â ÂHow are those events going to be supported?

Â 2/ Features impacting all counters

Â Â Â ÂOn some PMU models, e.g., Itanium, they are certain features which have
Â Â Â Âan influence on all counters that are active. For instance, there is a
Â Â Â Âway to restrict monitoring to a range of continuous code or data
Â Â Â Âaddresses using both some PMU registers and the debug registers.

Â Â Â ÂGiven that the API exposes events (counters) as independent of each
Â Â Â Âother, I wonder how range restriction could be implemented.

Â Â Â ÂSimilarly, on Itanium, there are global behaviors. For instance, on
Â Â Â Âcounter overflow the entire PMU freezes all at once. That seems to be
Â Â Â Âcontradictory with the design of the API which creates the illusion of
Â Â Â Âindependence.

Â Â Â ÂWhat solutions do you propose?

Â 3/ AMD IBS

Â Â Â ÂHow is AMD IBS going to be implemented?

Â Â Â ÂIBS has two separate sets of registers. One to capture fetch related
Â Â Â Âdata and another one to capture instruction execution data. For each,
Â Â Â Âthere is one config register but multiple data registers. In each mode,
Â Â Â Âthere is a specific sampling period and IBS can interrupt.

Â Â Â ÂIt looks like you could define two pseudo events or event types and then
Â Â Â Âdefine a new record_format and read_format. ÂThat formats would only be
Â Â Â Âvalid for an IBS event.

Â Â Â ÂIs that how you intend to support IBS?

Â 4/ Intel PEBS

Â Â Â ÂSince Netburst-based processors, Intel PMUs support a hardware sampling
Â Â Â Âbuffer mechanism called PEBS.

Â Â Â ÂPEBS really became useful with Nehalem.

Â Â Â ÂNot all events support PEBS. Up until Nehalem, only one counter supported
Â Â Â ÂPEBS (PMC0). The format of the hardware buffer has changed between Core
Â Â Â Âand Nehalem. It is not yet architected, thus it can still evolve with
Â Â Â Âfuture PMU models.

Â Â Â ÂOn Nehalem, there is a new PEBS-based feature called Load Latency
Â Â Â ÂFiltering which captures where data cache misses occur
Â Â Â Â(similar to Itanium D-EAR). Activating this feature requires setting a
Â Â Â Âlatency threshold hosted in a separate PMU MSR.

Â Â Â ÂOn Nehalem, given that all 4 generic counters support PEBS, the
Â Â Â Âsampling buffer may contain samples generated by any of the 4 counters.
Â Â Â ÂThe buffer includes a bitmask of registers to determine the source
Â Â Â Âof the samples. Multiple bits may be set in the bitmask.

Â Â Â ÂHow PEBS will be supported for this new API?

Â 5/ Intel Last Branch Record (LBR)

Â Â Â ÂIntel processors since Netburst have a cyclic buffer hosted in
Â Â Â Âregisters which can record taken branches. Each taken branch is stored
Â Â Â Âinto a pair of LBR registers (source, destination). Up until Nehalem,
Â Â Â Âthere was not filtering capabilities for LBR. LBR is not an architected
Â Â Â ÂPMU feature.

Â Â Â ÂThere is no counter associated with LBR. Nehalem has a LBR_SELECT MSR.
Â Â Â ÂHowever there are some constraints on it given it is shared by threads.

Â Â Â ÂLBR is only useful when sampling and therefore must be combined with a
Â Â Â Âcounter. LBR must also be configured to freeze on PMU interrupt.

Â Â Â ÂHow is LBR going to be supported?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Steven Rostedt: "Re: [PATCH] tracing: annotate emit_log_char() notrace"
Previous message: Wu Zhangjin: "Re: [PATCH] mips-specific ftrace support"
In reply to: eranian: "comments on Performance Counters for Linux (PCL)"
Next in thread: Peter Zijlstra: "Re: comments on Performance Counters for Linux (PCL)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]