Re: comments on Performance Counters for Linux (PCL)

From: stephane eranian
Date: Fri May 29 2009 - 06:43:40 EST


On Thu, May 28, 2009 at 6:25 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>> I/ General API comments
>> Â 1/ Data structures
>> Â Â Â* struct perf_counter_hw_event
>> Â Â Â- I think this structure will be used to enable non-counting features,
>> Â Â Â Âe.g. IBS. Name is awkward. It is used to enable non-hardware events
>> Â Â Â Â(sw events). Why not call it: struct perf_event
> Sure a name change might be a good idea.
In that case, the open syscall should also be changed to something
more generic: perf_open()

>> Â Â Â- uint64_t irq_period
>> Â Â Â ÂIRQ is an x86 related name. Why not use smpl_period instead?
> don't really care, but IRQ seems used throughout linux, we could name
> the thing interrupt or sample period.
IRQ is well understood by kernel people and I agree it is not X86 specific.
But we are talking about user level developers, some not really software
engineers either, e.g., physicists.

>> Â Â Â- uint32_t record_type
>> Â Â Â ÂThis field is a bitmask. I believe 32-bit is too small to accommodate
>> Â Â Â Âfuture record formats.
> It currently controls 8 aspects of the overflow entry, do you really
> forsee the need for more than 32?
Again, given the perf_open() syscall is not on a critical path, it does not
hurt to pass a bigger struct and have provision for future extensions.

>> Â Â Â- uint32_t read_format
>> Â Â Â ÂDitto.
> I guess it doesn't hurt extending them..


>> Â Â Â- uint64_t exclude_*
>> Â Â Â ÂWhat is the meaning of exclude_user? Which priv levels are actually
>> Â Â Â Âexcluded?
> userspace
This is a fuzzy notion.

>> Â Â Â ÂTake Itanium, it has 4 priv levels and the PMU counters can monitor at
>> Â Â Â Âany priv levels or combination thereof?
> x86 has more priv rings too, but linux only uses 2, kernel (ring 0) and
> user (ring 2 iirc). Does ia64 expose more than 2 priv levels in linux?
X86 has priv level 0,1,2,3. The issue, though, it that the X86 PMU only
distinguishes 2 coarse levels: OS, USR. Where OS=0, USR=1,2,3.

IA64 has also 4 levels, but the difference is that the PMU can filter on
all 4 levels independently. The question is then, what does exclude_user
actually encompass there?

And then, there is VT on X86 and IA64...
AMD64 PMU as of family 10h has host and guest filters in the
PERFEVTSEL registers.

>> Â Â Â ÂWhen programming raw HW events, the priv level filtering is typically
>> Â Â Â Âalready included. Which setting has priority, raw encoding or the
>> Â Â Â Âexclude_*?
>> Â Â Â ÂLooking at the existing X86 implementation, it seems exclude_* can
>> Â Â Â Âoverride whatever is set in the raw event code.
>> Â Â Â ÂFor any events, but in particular, SW events, why not encode this in
>> Â Â Â Âthe config field, like it is for a raw HW event?
> Because config is about what we count, this is about where we count.
> Doesn't seem strange to separate these two.
For a monitor tool, this means it may need to do the work twice.

Imagine, I encode events using strings: INST_RETIRED:u=1:k=1.
This means measure INST_RETIRED, at user level and kernel level.
You typically pass this to a helper library and you get back the raw
event code, which includes the priv level mask. If that library is generic
and does not know about PCL, then the tool needs to extract either
from the raw code or the string, the priv level information to set the
exclude_* fields accordingly. The alternative is to make the library
PCL-aware and have it set the perf_event structure directly.

>> Â Â Â* struct perf_counter_mmap_page
>> Â Â Â ÂGiven there is only one counter per-page, there is an awful lot of
>> Â Â Â Âprecious RLIMIT_MEMLOCK space wasted for this.
>> Â Â Â ÂTypically, if you are self-sampling, you are not going to read the
>> Â Â Â Âcurrent value of the sampling period. That re-mapping trick is only
>> Â Â Â Âuseful when counting.
>> Â Â Â ÂWhy not make these two separate mappings (using the mmap offset as
>> Â Â Â Âthe indicator)?
>> Â Â Â ÂWith this approach, you would get one page back per sampling period
>> Â Â Â Âand that page could then be used for the actual samples.
> Not quite, you still need a place for the data_head.
You could put it at the beginning of the actual buffer. But then, I
suspect it will
break the logic you have in data_head (explained below).

>> Â2/ System calls
>> Â Â Â* ioctl()
>> Â Â Â ÂYou have defined 3 ioctls() so far to operate on an existing event.
>> Â Â Â ÂI was under the impression that ioctl() should not be used except for
>> Â Â Â Âdrivers.
> 4 actually.
Why not use 4 new syscalls instead of using ioctl().

>> Â Â Â* prctl()
>> Â Â Â ÂThe API is event-based. Each event gets a file descriptor. Events are
>> Â Â Â Âtherefore managed individually. Thus, to enable/disable, you need to
>> Â Â Â Âenable/disable each one separately.
>> Â Â Â ÂThe use of prctl() breaks this design choice. It is not clear what you
>> Â Â Â Âare actually enabling. It looks like you are enabling all the counters
>> Â Â Â Âattached to the thread. This is incorrect. With your implementation,
>> Â Â Â Âthe PMU can be shared between competing users. In particular, multiple
>> Â Â Â Âtools may be monitoring the same thread. Now, imagine, a tool is
>> Â Â Â Âmonitoring a self-monitoring thread which happens to start/stop its
>> Â Â Â Âmeasurement using prctl(). Then, that would also start/stop the
>> Â Â Â Âmeasurement of the external tool. I have verified that this is what is
>> Â Â Â Âactually happening.
> Recently changed that, it enables/disables all counters created by the
> task calling prctl().
And attached that what? Itself or anything?

>> Â Â Â ÂI believe this call is bogus and it should be eliminated. The interface
>> Â Â Â Âis exposing events individually therefore they should be controlled
>> Â Â Â Âindividually.
> Bogus maybe not, superfluous, yeah, its a simpler way than iterating all
> the fds you just created, saves a few syscalls.
Well, my point was that it does not fit well with your file descriptor oriented

>> Â3/ Counter width
>> Â Â Â ÂIt is not clear whether or not the API exposes counters as 64-bit wide
>> Â Â Â Âon PMUs which do not implement 64-bit wide counters.
>> Â Â Â ÂBoth irq_period and read() return 64-bit integers. However, it appears
>> Â Â Â Âthat the implementation is not using all the bits. In fact, on X86, it
>> Â Â Â Âappears the irq_period is truncated silently. I believe this is not
>> Â Â Â Âcorrect. If the period is not valid, an error should be returned.
>> Â Â Â ÂOtherwise, the tool will be getting samples at a rate different than
>> Â Â Â Âwhat it requested.
> Sure, fail creation when the specified period is larger than the
> supported counter width -- patch welcome.
Yes, but then that means tools need to know on which counter
the event is going to be programmed. Not all counters may have the
same width.

>> Â Â Â ÂI would assume that on the read() side, counts are accumulated as
>> Â Â Â Â64-bit integers. But if it is the case, then it seems there is an
>> Â Â Â Âasymmetry between period and counts.
>> Â Â Â ÂGiven that your API is high level, I don't think tools should have to
>> Â Â Â Âworry about the actual width of a counter. This is especially true
>> Â Â Â Âbecause they don't know which counters the event is going to go into
>> Â Â Â Âand if I recall correctly, on some PMU models, different counters can
>> Â Â Â Âhave different width (Power, I think).
>> Â Â Â ÂIt is rather convenient for tools to always manipulate counters as
>> Â Â Â Â64-bit integers. You should provide a consistent view between counts
>> Â Â Â Âand periods.
> So you're suggesting to artificually strech periods by say composing a
> single overflow from smaller ones, ignoring the intermediate overflow
> events?
Yes, you emulate actual 64-bit wide counters. In the case of perfmon,
there is no notion of sampling period. All counters are exposed as 64-bit
wide. You can write any value you want into a counter. If you want a period p,
then you program the counter to -p. The period p may be larger than the width
of the actual counter. That means you will get intermediate overflows. A final
overflow will make the 64-bit value wrap around and that's when you
record a sample.

>> Â4/ Grouping
>> Â Â Â ÂBy design, an event can only be part of one group at a time. Events in
>> Â Â Â Âa group are guaranteed to be active on the PMU at the same time. That
>> Â Â Â Âmeans a group cannot have more events than there are available counters
>> Â Â Â Âon the PMU. Tools may want to know the number of counters available in
>> Â Â Â Âorder to group their events accordingly, such that reliable ratios
>> Â Â Â Âcould be computed. It seems the only way to know this is by trial and
>> Â Â Â Âerror. This is not practical.
> Got a proposal to ammend this?
Either add a syscall for that, or better, expose this via sysfs.

>> Â5/ Multiplexing and scaling
>> Â Â Â ÂThe PMU can be shared by multiple programs each controlling a variable
>> Â Â Â Ânumber of events. Multiplexing occurs by default unless pinned is
>> Â Â Â Ârequested. The exclusive option only guarantees the group does not
>> Â Â Â Âshare the PMU with other groups while it is active, at least this is
>> Â Â Â Âmy understanding.
> We have pinned and exclusive. pinned means always on the PMU, exclusive
> means when on the PMU no-one else can be.
exclusive: no sharing even if the group does not use all the counters
AND they are
other events waiting for the resource. Right?

>> Â Â Â ÂBy default, you may be multiplexed and if that happens you cannot know
>> Â Â Â Âunless you request the timing information as part of the read_format.
>> Â Â Â ÂWithout it, and if multiplexing has occurred, bogus counts may be
>> Â Â Â Âreturned with no indication whatsoever.
> I don't see the problem, you knew they could get multiplexes, yet you
> didn't ask for the information needed to extrapolate the information,
> sounds like you get what you aksed for.
The API specification must then clearly say: events and groups of events
are multiplexed by default. Scaling is not done automatically.

>> Â Â Â ÂTo avoid returning misleading information, it seems like the API should
>> Â Â Â Ârefuse to open a non-pinned event which does not have
>> Â Â Â Âread_format. This would avoid a lot of confusion down the road.
> I prefer to give people rope and tell them how to tie the knot.
This is a case of silent error. I suspect many people will fall into that trap.
Need to make sure documentation warns about that.

>> Â7/ Multiplexing and system-wide
>> Â Â Â ÂMultiplexing is time-based and it is hooked into the timer tick. At
>> Â Â Â Âevery tick, the kernel tries to schedule another group of events.
>> Â Â Â ÂIn tickless kernels if a CPU is idle, no timer tick is generated,
>> Â Â Â Âtherefore no multiplexing occurs. This is incorrect. It's not because
>> Â Â Â Âthe CPU is idle, that there aren't any interesting PMU events to measure.
>> Â Â Â ÂParts of the CPU may still be active, e.g., caches and buses. And thus,
>> Â Â Â Âit is expected that multiplexing still happens.
>> Â Â Â ÂYou need to hook up the timer source for multiplexing to something else
>> Â Â Â Âwhich is not affected by tickless.
> Or inhibit nohz when there are active counters, but good point.
Don't want do use nohz because you would be modifying the system
you're trying to monitor.

>> Â8/ Controlling group multiplexing
>> Â Â Â ÂAlthough, multiplexing is somehow exposed to user via the timing
>> Â Â Â Âinformation. ÂI believe there is not enough control. I know of advanced
>> Â Â Â Âmonitoring tools which needs to measure over a dozen events in one
>> Â Â Â Âmonitoring session. Given that the underlying PMU does not have enough
>> Â Â Â Âcounters OR that certain events cannot be measured together, it is
>> Â Â Â Ânecessary to split the events into groups and multiplex them. Events
>> Â Â Â Âare not grouped at random AND groups are not ordered at random either.
>> Â Â Â ÂThe sequence of groups is carefully chosen such that related events are
>> Â Â Â Âin neighboring groups such that they measure similar parts of the
>> Â Â Â Âexecution. ÂThis way you can mitigate the fluctuations introduced by
>> Â Â Â Âmultiplexing and compare ratios. In other words, some tools may want to
>> Â Â Â Âcontrol the order in which groups are scheduled on the PMU.
> Current code RR groups in the order they are created, is more control
> needed?
I understand the creation order in the case of a single tool.

My point was more in the case of multiple groups from multiple tools competing.
Imagine, Tool A and B want to monitor thread T. A has 3 groups, B 2 groups.
imagine all groups are exclusive. Does this mean that all groups of A will be
multiplexed and THEN all groups of B, or can they be interleaved, e.g. 1 group
from A, followed by 1 group from B?

This behavior has to be clearly spelled out by the API.

>> Â9/ Event buffer
>> Â Â Â ÂThere is a kernel level event buffer which can be re-mapped read-only at
>> Â Â Â Âthe user level via mmap(). The buffer must be a multiple of page size
> 2^n actually
Yes. But this is rounded-up to pages because of remapping. So better make use
of the full space.

>> Â Â Â Âand must be at least 2-page long. The First page is used for the
>> Â Â Â Âcounter re-mapping and buffer header, the second for the actual event
>> Â Â Â Âbuffer.
> I think a single data page is valid too (2^0=1).
I have not tried that yet.

> Suppose we have mapped 4 pages (of page size 4k), that means our buffer
> position would be the lower 14 bits of data_head.
> Now say the last observed position was:
> Â0x00003458 (& 0x00003FFF == 0x3458)
> and the next observed position is:
> Â0x00013458 (& 0x00003FFF == 0x3458)
> We'd still be able to tell we overflowed 8 times.
Isn't it 4 times?

> Does this suffice?
Should work, assuming you have some bits left for the overflow.
That means you cannot actually go to 4GB of space unless you
know you cannot lose the race with the kernel.

>> Â 11/ reserve_percpu
>> Â Â Â ÂThere are more than counters on many PMU models. Counters are not
>> Â Â Â Âsymmetrical even on X86.
>> Â Â Â ÂWhat does this API actually guarantees in terms on what events a tool
>> Â Â Â Âwill be able to measure with the reserved counters?
>> II/ X86 comments
>> Â Mostly implementation related comments in this section.
>> Â 1/ Fixed counter and event on Intel
>> Â Â Â ÂYou cannot simply fall back to generic counters if you cannot find
>> Â Â Â Âa fixed counter. There are model-specific bugs, for instance
>> Â Â Â ÂUNHALTED_REFERENCE_CYCLES (0x013c), does not measure the same thing on
>> Â Â Â ÂNehalem when it is used in fixed counter 2 or a generic counter. The
>> Â Â Â Âsame is true on Core.
>> Â Â Â ÂYou cannot simply look at the event field code to determine whether
>> Â Â Â Âthis is an event supported by a fixed counters. You must look at the
>> Â Â Â Âother fields such as edge, invert, cnt-mask. If those are present then
>> Â Â Â Âyou have to fall back to using a generic counter as fixed counters only
>> Â Â Â Âsupport priv level filtering. As indicated above, though, the
>> Â Â Â Âprogramming UNHALTED_REFERENCE_CYCLES on a generic counter does not
>> Â Â Â Âcount the same thing, therefore you need to fail is filters other than
>> Â Â Â Âpriv levels are present on this event.
>> Â 2/ Event knowledge missing
>> Â Â Â ÂThere are constraints and bugs on some events in Intel Core and Nehalem.
>> Â Â Â ÂIn your model, those need to be taken care of by the kernel. Should the
>> Â Â Â Âkernel make the wrong decision, there would be no work-around for user
>> Â Â Â Âtools. Take the example I outlined just above with Intel fixed counters.
>> Â Â Â ÂConstraints do exist on AMD64 processors as well..
> Good thing updating the kernel is so easy ;-)

Not once this is in production though.

>> Â 3/ Interrupt throttling
>> Â Â Â ÂThere is apparently no way for a system admin to set the threshold. It
>> Â Â Â Âis hardcoded.
>> Â Â Â ÂThrottling occurs without the tool(s) knowing. I think this is a problem.
> Fixed, it has a sysctl now, is in generic code and emits timestamped
> throttle/unthrottle events to the data stream, Power also implemented
> the needed bits.

>> III/ Requests
>> Â 1/ Sampling period change
>> Â Â Â ÂAs it stands today, it seems there is no way to change a period but to
>> Â Â Â Âclose() the event file descriptor and start over.. When you close the
>> Â Â Â Âgroup leader, it is not clear to me what happens to the remaining events.
> The other events will be 'promoted' to individual counters and continue
> on until their fd is closed too.
So you'd better start from scratch because you will lose group sampling.

>> Â Â Â ÂI know of tools which want to adjust the sampling period based on the
>> Â Â Â Ânumber of samples they get per second.
> I recently implemented dynamic period stuff, it adjusts the period every
> tick so as to strive for a given target frequency.
I am wondering is the tool shouldn't be in charge of that rather than
the kernel.
At least it would give it more control about what is happening and when.

>> Â Â Â ÂBy design, your perf_counter_open() should not really be in the
>> Â Â Â Âcritical path, e.g., when you are processing samples from the event
>> Â Â Â Âbuffer. Thus, I think it would be good to have a dedicated call to
>> Â Â Â Âallow changing the period.
> Yet another ioctl() ?
I would say yet another syscall.

>> Â 2/ Sampling period randomization
>> Â Â Â ÂIt is our experience (on Itanium, for instance), that for certain
>> Â Â Â Âsampling measurements, it is beneficial to randomize the sampling
>> Â Â Â Âperiod a bit. This is in particular the case when sampling on an
>> Â Â Â Âevent that happens very frequently and which is not related to
>> Â Â Â Âtiming, e.g., branch_instructions_retired. Randomization helps mitigate
>> Â Â Â Âthe bias. You do not need anything sophisticated.. But when you are using
>> Â Â Â Âa kernel-level sampling buffer, you need to have to kernel randomize.
>> Â Â Â ÂRandomization needs to be supported per event.
> Corey raised this a while back, I asked what kind of parameters were
> needed and if a specific (p)RNG was specified.
> Is something with an (avg,std) good enough? Do you have an
> implementation that I can borrow, or even better a patch? :-)

I think all you need is a bitmask to control the range of variation of the
period. As I said, the randomizer does not need to be sophisticated.
In perfmon we originally used the Carta random number generator.
But nowadays, we use the existing random32() kernel function.

>> IV/ Open questions
>> Â 1/ Support for model-specific uncore PMU monitoring capabilities
>> Â Â Â ÂRecent processors have multiple PMUs. Typically one per core and but
>> Â Â Â Âalso one at the socket level, e.g., Intel Nehalem. It is expected that
>> Â Â Â Âthis API will provide access to these PMU as well.
>> Â Â Â ÂIt seems like with the current API, raw events for those PMUs would need
>> Â Â Â Âa new architecture-specific type as the event encoding by itself may
>> Â Â Â Ânot be enough to disambiguate between a core and uncore PMU event.
>> Â Â Â ÂHow are those events going to be supported?
> /me goes poke at the docs... and finds MSR_OFFCORE_RSP_0. Not sure I
> quite get what they're saying though, but yes
This one is not uncore, it's core. Name is confusing. The uncore is all the UNC_
stuff. See Vol3b section 18.17.2.

>> Â 2/ Features impacting all counters
>> Â Â Â ÂOn some PMU models, e.g., Itanium, they are certain features which have
>> Â Â Â Âan influence on all counters that are active. For instance, there is a
>> Â Â Â Âway to restrict monitoring to a range of continuous code or data
>> Â Â Â Âaddresses using both some PMU registers and the debug registers.
>> Â Â Â ÂGiven that the API exposes events (counters) as independent of each
>> Â Â Â Âother, I wonder how range restriction could be implemented.
> Setting them per counter and when scheduling the counters check for
> compatibility and stop adding counters to the pmu if the next counter is
> incompatible.
How would you pass the code range addresses per-counter?
Suppose I want to monitor CYCLES between 0x100000-0x200000.
Range is specified using debug registers.

>> Â Â Â ÂSimilarly, on Itanium, there are global behaviors. For instance, on
>> Â Â Â Âcounter overflow the entire PMU freezes all at once. That seems to be
>> Â Â Â Âcontradictory with the design of the API which creates the illusion of
>> Â Â Â Âindependence.
> Simply take the interrupt, deal with the overflow, and continue, its not
> like the hardware can do any better, can it?
Hardware cannot do more. That means other unrelated counters which happen
to have been scheduled at the same time will be blindspotted.

I suspect that for Itanium, the better way is to refuse to co-schedule events
from different groups, i.e., always run in exclusive mode.

>> Â 3/ AMD IBS
>> Â Â Â ÂHow is AMD IBS going to be implemented?
>> Â Â Â ÂIBS has two separate sets of registers. One to capture fetch related
>> Â Â Â Âdata and another one to capture instruction execution data. For each,
>> Â Â Â Âthere is one config register but multiple data registers. In each mode,
>> Â Â Â Âthere is a specific sampling period and IBS can interrupt.
>> Â Â Â ÂIt looks like you could define two pseudo events or event types and then
>> Â Â Â Âdefine a new record_format and read_format. ÂThat formats would only be
>> Â Â Â Âvalid for an IBS event.
>> Â Â Â ÂIs that how you intend to support IBS?
> I can't seem to locate the IBS stuff in the documentation currently, and
> I must admit I've not yet looked into it, others might have.
AMD BIOS and Kernel Developer's Guide (BKDG) for Family 10h, section 3.13.
You have the register descriptions.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at