Re: [RFCv3] perf: proposed perf_event_open() manpage

From: Michael Kerrisk (man-pages)
Date: Thu Dec 13 2012 - 04:47:44 EST


[CC += tglx]

On Mon, Nov 26, 2012 at 7:54 AM, Vince Weaver <vincent.weaver@xxxxxxxxx> wrote:
> Hello
>
> Here is an updated version of the proposed manpage.
>
> Ingo, Peter Z., I know you're busy but would it be possible for you
> to give this a sanity check? This is likely to become the document
> that most users of the perf_event ABI will be using so it's important
> to catch any problems with it now.
>
> The most recent changes have been updates to the documentation on signal
> generation on overflow and the PERF_EVENT_IOC_REFRESH ioctl(). It would
> be nice if people in the know could review these; the perf_event overflow
> handling is a mess that varies from kernel to kernel and the intended
> "official" behavior is hard to pin down.
> For example, on recent kernels it seems that attr.wakeup_events is
> ignored and an overflow signal is *always* sent on event overflow.

Thomas, and Ingo(?)

As the folk who first had a hand in implementing this system call
would you be willing to take a look at this page that Vincent has
devoted a great deal of effort to? Please?

Thanks,

Michael


> .\" Hey Emacs! This file is -*- nroff -*- source.
> .\"
> .\" Copyright (c) 2012, Vincent Weaver
> .\"
> .\" This is free documentation; you can redistribute it and/or
> .\" modify it under the terms of the GNU General Public License as
> .\" published by the Free Software Foundation; either version 2 of
> .\" the License, or (at your option) any later version.
> .\"
> .\" The GNU General Public License's references to "object code"
> .\" and "executables" are to be interpreted as the output of any
> .\" document formatting or typesetting system, including
> .\" intermediate and printed output.
> .\"
> .\" This manual is distributed in the hope that it will be useful,
> .\" but WITHOUT ANY WARRANTY; without even the implied warranty of
> .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> .\" GNU General Public License for more details.
> .\"
> .\" You should have received a copy of the GNU General Public
> .\" License along with this manual; if not, see
> .\" <http://www.gnu.org/licenses/>.
> .\"
> .\" This document is based on the perf_event.h header file, the
> .\" tools/perf/design.txt file, and a lot of bitter experience.
> .\"
> .TH PERF_EVENT_OPEN 2 2012-11-27 "Linux" "Linux Programmer's Manual"
> .SH NAME
> perf_event_open \- set up performance monitoring
> .SH SYNOPSIS
> .nf
> .B #include <linux/perf_event.h>
> .B #include <linux/hw_breakpoint.h>
> .sp
> .BI "int perf_event_open(struct perf_event_attr *" attr ,
> .BI " pid_t " pid ", int " cpu ", int " group_fd ,
> .BI " unsigned long " flags );
> .fi
>
> .IR Note :
> There is no glibc wrapper for this system call; see NOTES.
> .SH DESCRIPTION
> Given a list of parameters,
> .BR perf_event_open ()
> returns a file descriptor, for use in subsequent system calls
> .RB ( read "(2), " mmap "(2), " prctl "(2), " fcntl "(2), etc.)."
> .PP
> A call to
> .BR perf_event_open ()
> creates a file descriptor that allows measuring performance
> information.
> Each file descriptor corresponds to one
> event that is measured; these can be grouped together
> to measure multiple events simultaneously.
> .PP
> Events can be enabled and disabled in two ways: via
> .BR ioctl (2)
> and via
> .BR prctl (2) .
> When an event is disabled it does not count or generate overflows but does
> continue to exist and maintain its count value.
> .PP
> Events come in two flavors: counting and sampled.
> A
> .I counting
> event is one that is used for counting the aggregate number of events
> that occur.
> In general, counting event results are gathered with a
> .BR read (2)
> call.
> A
> .I sampling
> event periodically writes measurements to a buffer that can then
> be accessed via
> .BR mmap (2) .
> .SS Arguments
> .P
> The argument
> .I pid
> allows events to be attached to processes in various ways.
> If
> .I pid
> is 0, measurements happen on the current thread, if
> .I pid
> is greater than 0, the process indicated by
> .I pid
> is measured, and if
> .I pid
> is \-1, all processes are counted.
>
> The
> .I cpu
> argument allows measurements to be specific to a CPU.
> If
> .I cpu
> is greater than or equal to 0,
> measurements are restricted to the specified CPU;
> if
> .I cpu
> is \-1, the events are measured on all CPUs.
> .P
> Note that the combination of
> .IR pid " == \-1"
> and
> .IR cpu " == \-1"
> is not valid.
> .P
> A
> .IR pid " > 0"
> and
> .IR cpu " == \-1"
> setting measures per-process and follows that process to whatever CPU the
> process gets scheduled to.
> Per-process events can be created by any user.
> .P
> A
> .IR pid " == \-1"
> and
> .IR cpu " >= 0"
> setting is per-CPU and measures all processes on the specified CPU.
> Per-CPU events need the
> .B CAP_SYS_ADMIN
> capability or a
> .I /proc/sys/kernel/perf_event_paranoid
> value of less than 1.
> .P
> The
> .I group_fd
> argument allows event groups to be created.
> An event group has one event which is the group leader.
> The leader is created first, with
> .IR group_fd " = \-1."
> The rest of the group members are created with subsequent
> .BR perf_event_open ()
> calls with
> .IR group_fd
> being set to the fd of the group leader.
> (A single event on its own is created with
> .IR group_fd " = \-1"
> and is considered to be a group with only 1 member.)
> An event group is scheduled onto the CPU as a unit: it will only
> be put onto the CPU if all of the events in the group can be put onto
> the CPU.
> This means that the values of the member events can be
> meaningfully compared, added, divided (to get ratios), etc., with each
> other, since they have counted events for the same set of executed
> instructions.
> .P
> The
> .I flags
> argument takes one of the following values:
> .TP
> .BR PERF_FLAG_FD_NO_GROUP
> .\" FIXME The following sentence is unclear
> This flag allows creating an event as part of an event group but
> having no group leader.
> It is unclear why this is useful.
> .\" FIXME So, why is it useful?
> .TP
> .BR PERF_FLAG_FD_OUTPUT
> This flag re-routes the output from an event to the group leader.
> .TP
> .BR PERF_FLAG_PID_CGROUP " (Since Linux 2.6.39)."
> This flag activates per-container system-wide monitoring.
> A container
> is an abstraction that isolates a set of resources for finer grain
> control (CPUs, memory, etc...).
> In this mode, the event is measured
> only if the thread running on the monitored CPU belongs to the designated
> container (cgroup).
> The cgroup is identified by passing a file descriptor
> opened on its directory in the cgroupfs filesystem.
> For instance, if the
> cgroup to monitor is called
> .IR test ,
> then a file descriptor opened on
> .I /dev/cgroup/test
> (assuming cgroupfs is mounted on
> .IR /dev/cgroup )
> must be passed as the
> .I pid
> parameter.
> cgroup monitoring is only available
> for system-wide events and may therefore require extra permissions.
> .P
> The
> .I perf_event_attr
> structure provides detailed configuration information
> for the event being created.
>
> .in +4n
> .nf
> struct perf_event_attr {
> __u32 type; /* Type of event */
> __u32 size; /* Size of attribute structure */
> __u64 config; /* Type-specific configuration */
>
> union {
> __u64 sample_period; /* Period of sampling */
> __u64 sample_freq; /* Frequency of sampling */
> };
>
> __u64 sample_type; /* Specifies values included in sample */
> __u64 read_format; /* Specifies values returned in read */
>
> __u64 disabled : 1, /* off by default */
> inherit : 1, /* children inherit it */
> pinned : 1, /* must always be on PMU */
> exclusive : 1, /* only group on PMU */
> exclude_user : 1, /* don't count user */
> exclude_kernel : 1, /* don't count kernel */
> exclude_hv : 1, /* don't count hypervisor */
> exclude_idle : 1, /* don't count when idle */
> mmap : 1, /* include mmap data */
> comm : 1, /* include comm data */
> freq : 1, /* use freq, not period */
> inherit_stat : 1, /* per task counts */
> enable_on_exec : 1, /* next exec enables */
> task : 1, /* trace fork/exit */
> watermark : 1, /* wakeup_watermark */
> precise_ip : 2, /* skid constraint */
> mmap_data : 1, /* non-exec mmap data */
> sample_id_all : 1, /* sample_type all events */
> exclude_host : 1, /* don't count in host */
> exclude_guest : 1, /* don't count in guest */
> exclude_callchain_kernel : 1, /* exclude kernel callchains */
> exclude_callchain_user : 1, /* exclude user callchains */
> __reserved_1 : 41;
>
> union {
> __u32 wakeup_events; /* wakeup every n events */
> __u32 wakeup_watermark; /* bytes before wakeup */
> };
>
> __u32 bp_type; /* breakpoint type */
>
> union {
> __u64 bp_addr; /* breakpoint address */
> __u64 config1; /* extension of config */
> };
>
> union {
> __u64 bp_len; /* breakpoint length */
> __u64 config2; /* extension of config1 */
> };
> __u64 branch_sample_type; /* enum perf_branch_sample_type */
> __u64 sample_regs_user; /* user regs to dump on samples */
> __u32 sample_stack_user; /* size of stack to dump on samples */
> __u32 __reserved_2; /* Align to u64. */
>
> };
> .fi
> .in
>
> The fields of the
> .I perf_event_attr
> structure are described in more detail below:
>
> .TP
> .I type
> This field specifies the overall event type.
> It has one of the following values:
> .RS
> .TP
> .B PERF_TYPE_HARDWARE
> This indicates one of the "generalized" hardware events provided
> by the kernel.
> See the
> .I config
> field definition for more details.
> .TP
> .B PERF_TYPE_SOFTWARE
> This indicates one of the software-defined events provided by the kernel
> (even if no hardware support is available).
> .TP
> .B PERF_TYPE_TRACEPOINT
> This indicates a tracepoint
> provided by the kernel tracepoint infrastructure.
> .TP
> .B PERF_TYPE_HW_CACHE
> This indicates a hardware cache event.
> This has a special encoding, described in the
> .I config
> field definition.
> .TP
> .B PERF_TYPE_RAW
> This indicates a "raw" implementation-specific event in the
> .IR config " field."
> .TP
> .BR PERF_TYPE_BREAKPOINT " (Since Linux 2.6.33)"
> This indicates a hardware breakpoint as provided by the CPU.
> Breakpoints can be read/write accesses to an address as well as
> execution of an instruction address.
> .TP
> .RB "dynamic PMU"
> Since Linux 2.6.39,
> .BR perf_event_open()
> can support multiple PMUs.
> To enable this, a value exported by the kernel can be used in the
> .I type
> field to indicate which PMU to use.
> The value to use can be found in the sysfs filesystem:
> there is a subdirectory per PMU instance under
> .IR /sys/bus/event_source/devices .
> In each sub-directory there is a
> .I type
> file whose content is an integer that can be used in the
> .I type
> field.
> For instance,
> .I /sys/bus/event_source/devices/cpu/type
> contains the value for the core CPU PMU, which is usually 4.
> .RE
>
> .TP
> .I "size"
> The size of the
> .I perf_event_attr
> structure for forward/backward compatibility.
> Set this using
> .I sizeof(struct perf_event_attr)
> to allow the kernel to see
> the struct size at the time of compilation.
>
> The related define
> .B PERF_ATTR_SIZE_VER0
> is set to 64; this was the size of the first published struct.
> .B PERF_ATTR_SIZE_VER1
> is 72, corresponding to the addition of breakpoints in Linux 2.6.33.
> .B PERF_ATTR_SIZE_VER2
> is 80 corresponding to the addition of branch sampling in Linux 3.4.
> .B PERF_ATR_SIZE_VER3
> is 96 corresponding to the addition
> of sample_regs_user and sample_stack_user in Linux 3.7.
>
> .TP
> .I "config"
> This specifies which event you want, in conjunction with
> the
> .I type
> field.
> The
> .IR config1 " and " config2
> fields are also taken into account in cases where 64 bits is not
> enough to fully specify the event.
> The encoding of these fields are event dependent.
>
> The most significant bit (bit 63) of
> .I config
> signifies CPU-specific (raw) counter configuration data;
> if the most significant bit is unset, the next 7 bits are an event
> type and the rest of the bits are the event identifier.
>
> There are various ways to set the
> .I config
> field that are dependent on the value of the previously
> described
> .I type
> field.
> What follows are various possible settings for
> .I config
> separated out by
> .IR type .
>
> If
> .I type
> is
> .BR PERF_TYPE_HARDWARE ,
> we are measuring one of the generalized hardware CPU events.
> Not all of these are available on all platforms.
> Set
> .I config
> to one of the following:
> .RS 12
> .TP
> .B PERF_COUNT_HW_CPU_CYCLES
> Total cycles.
> Be wary of what happens during CPU frequency scaling
> .TP
> .B PERF_COUNT_HW_INSTRUCTIONS
> Retired instructions.
> Be careful, these can be affected by various
> issues, most notably hardware interrupt counts
> .TP
> .B PERF_COUNT_HW_CACHE_REFERENCES
> Cache accesses.
> Usually this indicates Last Level Cache accesses but this may
> vary depending on your CPU.
> This may include prefetches and coherency messages; again this
> depends on the design of your CPU.
> .TP
> .B PERF_COUNT_HW_CACHE_MISSES
> Cache misses.
> Usually this indicates Last Level Cache misses; this is intended to be
> used in conjunction with the
> .B PERF_COUNT_HW_CACHE_REFERENCES
> event to calculate cache miss rates.
> .TP
> .B PERF_COUNT_HW_BRANCH_INSTRUCTIONS
> Retired branch instructions.
> Prior to Linux 2.6.34, this used
> the wrong event on AMD processors.
> .TP
> .B PERF_COUNT_HW_BRANCH_MISSES
> Mispredicted branch instructions.
> .TP
> .B PERF_COUNT_HW_BUS_CYCLES
> Bus cycles, which can be different from total cycles.
> .TP
> .BR PERF_COUNT_HW_STALLED_CYCLES_FRONTEND " (Since Linux 3.0)"
> Stalled cycles during issue.
> .TP
> .BR PERF_COUNT_HW_STALLED_CYCLES_BACKEND " (Since Linux 3.0)"
> Stalled cycles during retirement.
> .TP
> .BR PERF_COUNT_HW_REF_CPU_CYCLES " (Since Linux 3.3)"
> Total cycles; not affected by CPU frequency scaling.
> .RE
> .IP
> If
> .I type
> is
> .BR PERF_TYPE_SOFTWARE ,
> we are measuring software events provided by the kernel.
> Set
> .I config
> to one of the following:
> .RS 12
> .TP
> .B PERF_COUNT_SW_CPU_CLOCK
> This reports the CPU clock, a high-resolution per-CPU timer.
> .TP
> .B PERF_COUNT_SW_TASK_CLOCK
> This reports a clock count specific to the task that is running.
> .TP
> .B PERF_COUNT_SW_PAGE_FAULTS
> This reports the number of page faults.
> .TP
> .B PERF_COUNT_SW_CONTEXT_SWITCHES
> This counts context switches.
> Until Linux 2.6.34, these were all reported as user-space
> events, after that they are reported as happening in the kernel.
> .TP
> .B PERF_COUNT_SW_CPU_MIGRATIONS
> This reports the number of times the process
> has migrated to a new CPU.
> .TP
> .B PERF_COUNT_SW_PAGE_FAULTS_MIN
> This counts the number of minor page faults.
> These did not require disk I/O to handle.
> .TP
> .B PERF_COUNT_SW_PAGE_FAULTS_MAJ
> This counts the number of major page faults.
> These required disk I/O to handle.
> .TP
> .BR PERF_COUNT_SW_ALIGNMENT_FAULTS " (Since Linux 2.6.33)"
> This counts the number of alignment faults.
> These happen when unaligned memory accesses happen; the kernel
> can handle these but it reduces performance.
> This only happens on some architectures (never on x86).
> .TP
> .BR PERF_COUNT_SW_EMULATION_FAULTS " (Since Linux 2.6.33)"
> This counts the number of emulation faults.
> The kernel sometimes traps on unimplemented instructions
> and emulates them for userspace.
> This can negatively impact performance.
> .RE
> .RE
>
>
> .RS
> If
> .I type
> is
> .BR PERF_TYPE_TRACEPOINT ,
> then we are measuring kernel tracepoints.
> The value to use in
> .I config
> can be obtained from under debugfs
> .I tracing/events/*/*/id
> if ftrace is enabled in the kernel.
>
> .RE
>
> .RS
> If
> .I type
> is
> .BR PERF_TYPE_HW_CACHE ,
> then we are measuring a hardware CPU cache event.
> To calculate the appropriate
> .I config
> value use the following equation:
> .RS 4
> .nf
>
> (perf_hw_cache_id) | (perf_hw_cache_op_id << 8) |
> (perf_hw_cache_op_result_id << 16)
> .fi
> .P
> where
> .I perf_hw_cache_id
> is one of:
> .RS
> .TP
> .B PERF_COUNT_HW_CACHE_L1D
> for measuring Level 1 Data Cache
> .TP
> .B PERF_COUNT_HW_CACHE_L1I
> for measuring Level 1 Instruction Cache
> .TP
> .B PERF_COUNT_HW_CACHE_LL
> for measuring Last-Level Cache
> .TP
> .B PERF_COUNT_HW_CACHE_DTLB
> for measuring the Data TLB
> .TP
> .B PERF_COUNT_HW_CACHE_ITLB
> for measuring the Instruction TLB
> .TP
> .B PERF_COUNT_HW_CACHE_BPU
> for measuring the branch prediction unit
> .TP
> .BR PERF_COUNT_HW_CACHE_NODE " (Since Linux 3.0)"
> for measuring local memory accesses
> .RE
>
> .P
> and
> .I perf_hw_cache_op_id
> is one of
> .RS
> .TP
> .B PERF_COUNT_HW_CACHE_OP_READ
> for read accesses
> .TP
> .B PERF_COUNT_HW_CACHE_OP_WRITE
> for write accesses
> .TP
> .B PERF_COUNT_HW_CACHE_OP_PREFETCH
> for prefetch accesses
> .RE
>
> .P
> and
> .I perf_hw_cache_op_result_id
> is one of
> .RS
> .TP
> .B PERF_COUNT_HW_CACHE_RESULT_ACCESS
> to measure accesses
> .TP
> .B PERF_COUNT_HW_CACHE_RESULT_MISS
> to measure misses
> .RE
> .RE
>
> If
> .I type
> is
> .BR PERF_TYPE_RAW ,
> then a custom "raw"
> .I config
> value is needed.
> Most CPUs support events that are not covered by the "generalized" events.
> These are implementation defined; see your CPU manual (for example
> the Intel Volume 3B documentation or the AMD BIOS and Kernel Developer
> Guide).
> The libpfm4 library can be used to translate from the name in the
> architectural manuals to the raw hex value
> .BR perf_event_open ()
> expects in this field.
>
> If
> .I type
> is
> .BR PERF_TYPE_BREAKPOINT ,
> then leave
> .I config
> set to zero.
> Its parameters are set in other places.
> .RE
> .TP
> .IR sample_period ", " sample_freq
> A "sampling" counter is one that generates an interrupt
> every N events, where N is given by
> .IR sample_period .
> A sampling counter has
> .IR sample_period " > 0."
> When an overflow interrupt occurs, requested data is recorded
> in the mmap buffer.
> The
> .I sample_type
> field controls what data is recorded on each interrupt.
>
> .I sample_freq
> can be used if you wish to use frequency rather than period.
> In this case you set the
> .I freq
> flag.
> The kernel will adjust the sampling period
> to try and achieve the desired rate.
> The rate of adjustment is a
> timer tick.
>
>
> .TP
> .I "sample_type"
> The various bits in this field specify which values to include
> in the sample.
> They will be recorded in a ring-buffer,
> which is available to user-space using
> .BR mmap (2).
> The order in which the values are saved in the
> sample are documented in the MMAP Layout subsection below;
> it is not the
> .I "enum perf_event_sample_format"
> order.
> .RS
> .TP
> .B PERF_SAMPLE_IP
> Records instruction pointer.
> .TP
> .B PERF_SAMPLE_TID
> Records the process and thread ids.
> .TP
> .B PERF_SAMPLE_TIME
> Records a timestamp.
> .TP
> .B PERF_SAMPLE_ADDR
> Records an address, if applicable.
> .TP
> .B PERF_SAMPLE_READ
> Record counter values for all events in a group, not just the group leader.
> .TP
> .B PERF_SAMPLE_CALLCHAIN
> Records the callchain (stack backtrace).
> .TP
> .B PERF_SAMPLE_ID
> Records a unique ID for the opened event's group leader.
> .TP
> .B PERF_SAMPLE_CPU
> Records CPU number.
> .TP
> .B PERF_SAMPLE_PERIOD
> Records the current sampling period.
> .TP
> .B PERF_SAMPLE_STREAM_ID
> Records a unique ID for the opened event.
> Unlike
> .B PERF_SAMPLE_ID
> the actual ID is returned, not the group leader.
> This ID is the same as the one returned by PERF_FORMAT_ID.
> .TP
> .B PERF_SAMPLE_RAW
> Records additional data, if applicable.
> Usually returned by tracepoint events.
> .TP
> .BR PERF_SAMPLE_BRANCH_STACK " (Since Linux 3.4)"
> Records the branch stack. See branch_sample_type.
> .TP
> .BR PERF_SAMPLE_REGS_USER " (Since Linux 3.7)"
> Records the current register state.
> .TP
> .BR PERF_SAMPLE_STACK_USER " (Since Linux 3.7)"
> [To be documented]
> .RE
>
> .TP
> .IR "read_format"
> This field specifies the format of the data returned by
> .BR read (2)
> on a
> .BR perf_event_open()
> file descriptor.
> .RS
> .TP
> .B PERF_FORMAT_TOTAL_TIME_ENABLED
> Adds the 64-bit "time_enabled" field.
> This can be used to calculate estimated totals if
> the PMU is overcommitted and multiplexing is happening.
> .TP
> .B PERF_FORMAT_TOTAL_TIME_RUNNING
> Adds the 64-bit "time_running" field.
> This can be used to calculate estimated totals if
> the PMU is overcommitted and multiplexing is happening.
> .TP
> .B PERF_FORMAT_ID
> Adds a 64-bit unique value that corresponds to the event group.
> .TP
> .B PERF_FORMAT_GROUP
> Allows all counter values in an event group to be read with one read.
> .RE
>
> .TP
> .IR "disabled"
> The
> .I disabled
> bit specifies whether the counter starts out disabled or enabled.
> If disabled, the event can later be enabled by
> .BR ioctl (2),
> .BR prctl (2),
> or
> .IR enable_on_exec .
>
> .TP
> .IR "inherit"
> The
> .I inherit
> bit specifies that this counter should count events of child
> tasks as well as the task specified.
> This only applies to new children, not to any existing children at
> the time the counter is created (nor to any new children of
> existing children).
>
> Inherit does not work for some combinations of
> .IR read_format s,
> such as
> .BR PERF_FORMAT_GROUP .
>
> .TP
> .IR "pinned"
> The
> .I pinned
> bit specifies that the counter should always be on the CPU if at all
> possible.
> It only applies to hardware counters and only to group leaders.
> If a pinned counter cannot be put onto the CPU (e.g., because there are
> not enough hardware counters or because of a conflict with some other
> event), then the counter goes into an 'error' state, where reads
> return end-of-file (i.e.,
> .BR read (2)
> returns 0) until the counter is subsequently enabled or disabled.
>
> .TP
> .IR "exclusive"
> The
> .I exclusive
> bit specifies that when this counter's group is on the CPU,
> it should be the only group using the CPU's counters.
> In the future this may allow monitoring programs to
> support PMU features that need to run alone so that they do not
> disrupt other hardware counters.
>
> .TP
> .IR "exclude_user"
> If this bit is set, the count excludes events that happen in user-space.
>
> .TP
> .IR "exclude_kernel"
> If this bit is set, the count excludes events that happen in kernel-space.
>
> .TP
> .IR "exclude_hv"
> If this bit is set, the count excludes events that happen in the
> hypervisor.
> This is mainly for PMUs that have built-in support for handling this
> (such as POWER).
> Extra support is needed for handling hypervisor measurements on most
> machines.
>
> .TP
> .IR "exclude_idle"
> If set, don't count when the CPU is idle.
>
> .TP
> .IR "mmap"
> The
> .I mmap
> bit enables recording of exec mmap events.
>
> .TP
> .IR "comm"
> The
> .I comm
> bit enables tracking of process command name as modified by the
> .IR exec (2)
> and
> .IR prctl (PR_SET_NAME)
> system calls.
> Unfortunately for tools,
> there is no way to distinguish one system call versus the other.
>
> .TP
> .IR "freq"
> If this bit is set, then
> .I sample_frequency
> not
> .I sample_period
> is used when setting up the sampling interval.
>
> .TP
> .IR "inherit_stat"
> This bit enables saving of event counts on context switch for
> inherited tasks.
> This is only meaningful if the
> .I inherit
> field is set.
>
> .TP
> .IR "enable_on_exec"
> If this bit is set, a counter is automatically
> enabled after a call to
> .BR exec (2).
>
> .TP
> .IR "task"
> If this bit is set, then
> fork/exit notifications are included in the ring buffer.
>
> .TP
> .IR "watermark"
> If set, have a sampling interrupt happen when we cross the
> .I wakeup_watermark
> boundary.
> Otherwise interrupts happen after
> .I wakeup_events
> samples.
>
> .TP
> .IR "precise_ip" " (Since Linux 2.6.35)"
> This controls the amount of skid.
> Skid is how many instructions
> execute between an event of interest happening and the kernel
> being able to stop and record the event.
> Smaller skid is
> better and allows more accurate reporting of which events
> correspond to which instructions, but hardware is often limited
> with how small this can be.
>
> The values of this are the following:
> .RS
> .TP
> 0 -
> .B SAMPLE_IP
> can have arbitrary skid
> .TP
> 1 -
> .B SAMPLE_IP
> must have constant skid
> .TP
> 2 -
> .B SAMPLE_IP
> requested to have 0 skid
> .TP
> 3 -
> .B SAMPLE_IP
> must have 0 skid.
> See also
> .BR PERF_RECORD_MISC_EXACT_IP .
> .RE
>
> .TP
> .IR "mmap_data" " (Since Linux 2.6.36)"
> The counterpart of the
> .I mmap
> field, but enables including data mmap events
> in the ring-buffer.
>
> .TP
> .IR "sample_id_all" " (Since Linux 2.6.38)"
> If set, then TID, TIME, ID, CPU, and STREAM_ID can
> additionally be included in
> .RB non- PERF_RECORD_SAMPLE s
> if the corresponding
> .I sample_type
> is selected.
>
> .TP
> .IR "exclude_host" " (Since Linux 3.2)"
> Do not measure time spent in VM host
>
> .TP
> .IR "exclude_guest" " (Since Linux 3.2)"
> Do not measure time spent in VM guest
>
> .TP
> .IR "exclude_callchain_kernel" " (Since Linux 3.7)"
> Do not include kernel callchains.
>
> .TP
> .IR "exclude_callchain_user" " (Since Linux 3.7)"
> Do not include user callchains.
>
> .TP
> .IR "wakeup_events" ", " "wakeup_watermark"
> This union sets how many samples
> .RI ( wakeup_events )
> or bytes
> .RI ( wakeup_watermark )
> happen before an overflow signal happens.
> Which one is used is selected by the
> .I watermark
> bitflag.
>
> .TP
> .IR "bp_type" " (Since Linux 2.6.33)"
> This chooses the breakpoint type.
> It is one of:
> .RS
> .TP
> .BR HW_BREAKPOINT_EMPTY
> no breakpoint
> .TP
> .BR HW_BREAKPOINT_R
> count when we read the memory location
> .TP
> .BR HW_BREAKPOINT_W
> count when we write the memory location
> .TP
> .BR HW_BREAKPOINT_RW
> count when we read or write the memory location
> .TP
> .BR HW_BREAKPOINT_X
> count when we execute code at the memory location
>
> .LP
> The values can be combined via a bitwsie or, but the
> combination of
> .B HW_BREAKPOINT_R
> or
> .B HW_BREAKPOINT_W
> with
> .B HW_BREAKPOINT_X
> is not allowed.
> .RE
>
> .TP
> .IR "bp_addr" " (Since Linux 2.6.33)"
> .I bp_addr
> address of the breakpoint.
> For execution breakpoints this is the memory address of the instruction
> of interest; for read and write breakpoints it is the memory address
> of the memory location of interest.
>
> .TP
> .IR "config1" " (Since Linux 2.6.39)"
> .I config1
> is used for setting events that need an extra register or otherwise
> do not fit in the regular config field.
> Raw OFFCORE_EVENTS on Nehalem/Westmere/SandyBridge use this field
> on 3.3 and later kernels.
>
> .TP
> .IR "bp_len" " (Since Linux 2.6.33)"
> .I bp_len
> is the length of the breakpoint being measured if
> .I type
> is
> .BR PERF_TYPE_BREAKPOINT .
> Options are
> .BR HW_BREAKPOINT_LEN_1 ,
> .BR HW_BREAKPOINT_LEN_2 ,
> .BR HW_BREAKPOINT_LEN_4 ,
> .BR HW_BREAKPOINT_LEN_8 .
> For an execution breakpoint, set this to
> .IR sizeof(long) .
>
> .TP
> .IR "config2" " (Since Linux 2.6.39)"
>
> .I config2
> is a further extension of the
> .I config1
> field.
>
> .TP
> .IR "branch_sample_type" " (Since Linux 3.4)"
> This is used with the CPUs hardware branch sampling, if available.
> It can have one of the following values:
> .RS
> .TP
> .B PERF_SAMPLE_BRANCH_USER
> Branch target is in user space
> .TP
> .B PERF_SAMPLE_BRANCH_KERNEL
> Branch target is in kernel space
> .TP
> .B PERF_SAMPLE_BRANCH_HV
> Branch target is in hypervisor
> .TP
> .B PERF_SAMPLE_BRANCH_ANY
> Any branch type.
> .TP
> .B PERF_SAMPLE_BRANCH_ANY_CALL
> Any call branch
> .TP
> .B PERF_SAMPLE_BRANCH_ANY_RETURN
> Any return branch
> .TP
> .BR PERF_SAMPLE_BRANCH_IND_CALL
> Indirect calls
> .TP
> .BR PERF_SAMPLE_BRANCH_PLM_ALL
> User, kernel, and hv
> .RE
>
> .TP
> .IR "sample_regs_user" " (Since Linux 3.7)"
> This defines the set of user registers to dump on samples.
> See asm/perf_regs.h.
>
> .TP
> .IR "sample_stack_user" " (Since Linux 3.7)"
> This defines the size of the user stack to dump on sample.
>
> .RE
>
> .SS "Reading Results"
> Once a
> .BR perf_event_open()
> file descriptor has been opened, the values
> of the events can be read from the file descriptor.
> The values that are there are specified by the
> .I read_format
> field in the attr structure at open time.
>
> If you attempt to read into a buffer that is not big enough to hold the
> data
> .B ENOSPC
> is returned
>
> Here is the layout of the data returned by a read:
>
> If
> .B PERF_FORMAT_GROUP
> was specified to allow reading all events in a group at once:
>
> .in +4n
> .nf
> struct read_format {
> u64 nr; /* The number of events */
> u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
> u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
> struct {
> u64 value; /* The value of the event */
> u64 id; /* if PERF_FORMAT_ID */
> } values[nr];
> };
> .fi
> .in
>
> If
> .B PERF_FORMAT_GROUP
> was
> .I not
> specified, then the read values look as following:
>
> .in +4n
> .nf
> struct read_format {
> u64 value; /* The value of the event */
> u64 time_enabled; /* if PERF_FORMAT_TOTAL_TIME_ENABLED */
> u64 time_running; /* if PERF_FORMAT_TOTAL_TIME_RUNNING */
> u64 id; /* if PERF_FORMAT_ID */
> };
> .fi
> .in
>
> The values read are described in more detail below.
> .RS
> .TP
> .I nr
> The number of events in this file descriptor.
> Only available if
> .B PERF_FORMAT_GROUP
> was specified.
>
> .TP
> .IR time_enabled ", " time_running
> Total time the event was enabled and running.
> Normally these are the same.
> If more events are started
> than available counter slots on the PMU, then multiplexing
> happens and events only run part of the time.
> In that case the
> .I time_enabled
> and
> .I time running
> values can be used to scale an estimated value for the count.
>
> .TP
> .I value
> An unsigned 64-bit value containing the counter result.
>
> .TP
> .I id
> A globally unique value for this particular event, only there if
> .B PERF_FORMAT_ID
> was specified in read_format.
>
> .RE
> .RE
>
>
>
> .SS "MMAP Layout"
>
> When using
> .BR perf_event_open()
> in sampled mode, asynchronous events
> (like counter overflow or
> .B PROT_EXEC
> mmap tracking)
> are logged into a ring-buffer.
> This ring-buffer is created and accessed through
> .BR mmap (2).
>
> The mmap size should be 1+2^n pages, where the first page is a
> metadata page
> .IR ( "struct perf_event_mmap_page" )
> that contains various
> bits of information such as where the ring-buffer head is.
>
> Before kernel 2.6.39, there is a bug that means you must allocate a mmap
> ring buffer when sampling even if you do not plan to access it.
>
> The structure of the first metadata mmap page is as follows:
>
> .in +4n
> .nf
> struct perf_event_mmap_page {
> __u32 version; /* version number of this structure */
> __u32 compat_version; /* lowest version this is compat with */
> __u32 lock; /* seqlock for synchronization */
> __u32 index; /* hardware counter identifier */
> __s64 offset; /* add to hardware counter value */
> __u64 time_enabled; /* time event active */
> __u64 time_running; /* time event on CPU */
> union {
> __u64 capabilities;
> __u64 cap_usr_time : 1,
> cap_usr_rdpmc : 1,
> };
> __u16 pmc_width;
> __u16 time_shift;
> __u32 time_mult;
> __u64 time_offset;
> __u64 __reserved[120]; /* Pad to 1k */
> __u64 data_head; /* head in the data section */
> __u64 data_tail; /* user-space written tail */
> }
> .fi
> .in
>
>
>
> The following looks at the fields in the
> .I perf_event_mmap_page
> structure in more detail.
>
> .RS
>
> .TP
> .I version
> Version number of this structure.
>
> .TP
> .I compat_version
> The lowest version this is compatible with.
>
> .TP
> .I lock
> A seqlock for synchronization.
>
> .TP
> .I index
> A unique hardware counter identifier.
>
> .TP
> .I offset
> .\" FIXME clarify
> Add this to hardware counter value??
>
> .TP
> .I time_enabled
> Time the event was active.
>
> .TP
> .I time_running
> Time the event was running.
>
> .TP
> .I cap_usr_time
> User time capability
>
> .TP
> .I cap_usr_rdpmc
> If the hardware supports user-space read of performance counters
> without syscall (this is the "rdpmc" instruction on x86), then
> the following code can be used to do a read:
>
> .in +4n
> .nf
> u32 seq, time_mult, time_shift, idx, width;
> u64 count, enabled, running;
> u64 cyc, time_offset;
> s64 pmc = 0;
>
> do {
> seq = pc\->lock;
> barrier();
> enabled = pc\->time_enabled;
> running = pc\->time_running;
>
> if (pc\->cap_usr_time && enabled != running) {
> cyc = rdtsc();
> time_offset = pc\->time_offset;
> time_mult = pc\->time_mult;
> time_shift = pc\->time_shift;
> }
>
> idx = pc\->index;
> count = pc\->offset;
>
> if (pc\->cap_usr_rdpmc && idx) {
> width = pc\->pmc_width;
> pmc = rdpmc(idx \- 1);
> }
>
> barrier();
> } while (pc\->lock != seq);
> .fi
> .in
>
>
>
> .TP
> .I pmc_width
> If
> .IR cap_usr_rdpmc ,
> this field provides the bit-width of the value
> read using the rdpmc or equivalent instruction.
> This can be used to sign extend the result like:
>
> .in +4n
> .nf
> pmc <<= 64 \- pmc_width;
> pmc >>= 64 \- pmc_width; // signed shift right
> count += pmc;
> .fi
> .in
>
>
> .TP
> .IR time_shift ", " time_mult ", " time_offset
>
> If
> .IR cap_usr_time ,
> these fields can be used to compute the time
> delta since time_enabled (in ns) using rdtsc or similar.
> .nf
>
> u64 quot, rem;
> u64 delta;
> quot = (cyc >> time_shift);
> rem = cyc & ((1 << time_shift) \- 1);
> delta = time_offset + quot * time_mult +
> ((rem * time_mult) >> time_shift);
> .fi
>
> Where time_offset,time_mult,time_shift and cyc are read in the
> seqcount loop described above.
> This delta can then be added to
> enabled and possible running (if idx), improving the scaling:
> .nf
>
> enabled += delta;
> if (idx)
> running += delta;
> quot = count / running;
> rem = count % running;
> count = quot * enabled + (rem * enabled) / running;
> .fi
>
> .TP
> .I data_head
> This points to the head of the data section.
> The value continuously increases, it does not wrap. The value
> needs to be manually wrapped by the size of the mmap buffer
> before accessing the samples.
>
> On SMP-capable platforms, after reading the data_head value,
> user-space should issue an rmb().
>
> .TP
> .I data_tail;
> When the mapping is
> .BR PROT_WRITE ,
> the data_tail value should be written by
> userspace to reflect the last read data.
> In this case the kernel will not over-write unread data.
>
> .RE
>
>
> The following 2^n ring-buffer pages have the layout described below.
>
> If
> .I perf_event_attr.sample_id_all
> is set, then all event types will
> have the sample_type selected fields related to where/when (identity)
> an event took place (TID, TIME, ID, CPU, STREAM_ID) described in
> .B PERF_RECORD_SAMPLE
> below, it will be stashed just after the
> perf_event_header and the fields already present for the existing
> fields, i.e., at the end of the payload.
> That way a newer perf.data
> file will be supported by older perf tools, with these new optional
> fields being ignored.
>
> The mmap values start with a header:
>
> .in +4n
> .nf
> struct perf_event_header {
> __u32 type;
> __u16 misc;
> __u16 size;
> };
> .fi
> .in
>
> Below, we describe the
> .I perf_event_header
> fields in more detail.
>
> .TP
> .I type
> The
> .I type
> value is one of the below.
> The values in the corresponding record (that follows the header)
> depend on the
> .I type
> selected as shown.
>
> .RS
> .TP
> .B PERF_RECORD_MMAP
> The MMAP events record the
> .B PROT_EXEC
> mappings so that we can correlate
> userspace IPs to code.
> They have the following structure:
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u32 pid, tid;
> u64 addr;
> u64 len;
> u64 pgoff;
> char filename[];
> };
> .fi
> .in
>
> .TP
> .B PERF_RECORD_LOST
> This record indicates when events are lost.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u64 id;
> u64 lost;
> };
> .fi
> .in
>
> .RS
> .TP
> .I id
> is the unique event ID for the samples that were lost.
> .TP
> .I lost
> is the number of events that were lost.
> .RE
>
> .TP
> .B PERF_RECORD_COMM
> This record indicates a change in the process name.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u32 pid, tid;
> char comm[];
> };
> .fi
> .in
>
> .TP
> .B PERF_RECORD_EXIT
> This record indicates a process exit event.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u32 pid, ppid;
> u32 tid, ptid;
> u64 time;
> };
> .fi
> .in
>
> .TP
> .BR PERF_RECORD_THROTTLE ", " PERF_RECORD_UNTHROTTLE
> This record indicates a throttle/unthrottle event.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u64 time;
> u64 id;
> u64 stream_id;
> };
> .fi
> .in
>
> .TP
> .B PERF_RECORD_FORK
> This record indicates a fork event.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u32 pid, ppid;
> u32 tid, ptid;
> u64 time;
> };
> .fi
> .in
>
> .TP
> .B PERF_RECORD_READ
> This record indicates a read event.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u32 pid, tid;
> struct read_format values;
> };
> .fi
> .in
>
> .TP
> .B PERF_RECORD_SAMPLE
> This record indicates a sample.
>
> .in +4n
> .nf
> struct {
> struct perf_event_header header;
> u64 ip; /* if PERF_SAMPLE_IP */
> u32 pid, tid; /* if PERF_SAMPLE_TID */
> u64 time; /* if PERF_SAMPLE_TIME */
> u64 addr; /* if PERF_SAMPLE_ADDR */
> u64 id; /* if PERF_SAMPLE_ID */
> u64 stream_id; /* if PERF_SAMPLE_STREAM_ID */
> u32 cpu, res; /* if PERF_SAMPLE_CPU */
> u64 period; /* if PERF_SAMPLE_PERIOD */
> struct read_format v; /* if PERF_SAMPLE_READ */
> u64 nr; /* if PERF_SAMPLE_CALLCHAIN */
> u64 ips[nr]; /* if PERF_SAMPLE_CALLCHAIN */
> u32 size; /* if PERF_SAMPLE_RAW */
> char data[size]; /* if PERF_SAMPLE_RAW */
> u64 bnr; /* if PERF_SAMPLE_BRANCH_STACK */
> struct perf_branch_entry
> lbr[bnr]; /* if PERF_SAMPLE_BRANCH_STACK */
> u64 abi; /* if PERF_SAMPLE_REGS_USER */
> u64 regs[weight(mask)]; /* if PERF_SAMPLE_REGS_USER */
> u64 size; /* if PERF_SAMPLE_STACK_USER */
> char data[size]; /* if PERF_SAMPLE_STACK_USER */
> u64 dyn_size; /* if PERF_SAMPLE_STACK_USER */
> };
> .fi
>
> .RS
> .TP
> .I ip
> If PERF_SAMPLE_IP is enabled then a 64-bit instruction
> pointer value is included.
>
> .TP
> .IR pid , tid
> If PERF_SAMPLE_TID is enabled then a 32-bit process id
> and 32-bit thread id are included.
>
> .TP
> .I time
> If PERF_SAMPLE_TIME is enabled then a 64-bit timestamp
> is included.
> This is obtained via local_clock() which is a hardware timestamp
> if available and the jiffies value if not.
>
> .TP
> .I addr
> If PERF_SAMPLE_ADDR is enabled than a 64-bit address is included.
> This is usually the address of a tracepoint,
> breakpoint, or software event; otherwise the value is 0.
>
> .TP
> .I id
> If PERF_SAMPLE_ID is enabled a 64-bit unique ID is included.
> If the event is a member of an event group, the group leader ID is returned.
> This ID is the same as the one returned by PERF_FORMAT_ID.
>
> .TP
> .I stream_id
> If PERF_SAMPLE_STREAM_ID is enabled a 64-bit unique ID is included.
> Unlike
> .B PERF_SAMPLE_ID
> the actual ID is returned, not the group leader.
> This ID is the same as the one returned by PERF_FORMAT_ID.
>
> .TP
> .IR cpu , res
> If PERF_SAMPLE_CPU is enabled this is a 32-bit value indicating
> which CPU was being used, in addition to a reserved (unused)
> 32-bit value.
>
> .TP
> .I period
> If PERF_SAMPLE_PERIOD is enabled a 64-bit value indicating
> the current sampling period is written.
>
> .TP
> .I v
> If PERF_SAMPLE_READ is enabled a structure of type read_format
> is included which has values for all events in the event group.
> The values included depend on the
> .I read_format
> value used at perf_event_open() time.
>
> .TP
> .IR nr , ips[nr]
> If PERF_SAMPLE_CALLCHAIN is enabled then a 64-bit number is included
> which indicates how many following 64-bit instruction pointers will
> follow. This is the current callchain.
>
> .TP
> .IR size , data
> If PERF_SAMPLE_RAW is enabled then a 32-bit value indicating size
> is included followed by an array of 8-bit values of length size.
> The values are padded with 0 to have 64-bit alignment.
>
> This RAW record data is opaque with respect to the ABI.
> The ABI doesn't make any promises with respect to the stability
> of its content, it may vary depending
> on event, hardware, and kernel version.
>
> .TP
> .IR bnr , lbr[bnr]
> If PERF_SAMPLE_BRANCH_STACK is enabled then a 64-bit value indicating
> the number of records is included, followed by bnr perf_branch_entry
> structures. These structures have from, to, and flags values indicating
> the from and to addresses from the branches on the callstack.
>
> .TP
> .IR abi , regs[weight(mask)]
> If PERF_SAMPLE_REGS_USER is enabled then
> [to be documented].
>
> The
> .I abi
> field is one of
> .BR PERF_SAMPLE_REGS_ABI_NONE ", " PERF_SAMPLE_REGS_ABI_32 " or "
> .BR PERF_SAMPLE_REGS_ABI_64 ". "
>
> .TP
> .IR size , data[size] , dyn_size
> If PERF_SAMPLE_STACK_USER is enabled then
> [to be documented].
>
> .RE
>
> .RE
>
>
> .TP
> .I misc
> The
> .I misc
> field contains additional information about the sample.
>
> The CPU mode can be determined from this value by masking with
> .B PERF_RECORD_MISC_CPUMODE_MASK
> and looking for one of the following (note these are not
> bitmasks, only one can be set at a time):
> .RS
> .TP
> .B PERF_RECORD_MISC_CPUMODE_UNKNOWN
> Unknown CPU mode.
> .TP
> .B PERF_RECORD_MISC_KERNEL
> Sample happened in the kernel.
> .TP
> .B PERF_RECORD_MISC_USER
> Sample happened in user code.
> .TP
> .B PERF_RECORD_MISC_HYPERVISOR
> Sample happened in the hypervisor.
> .TP
> .B PERF_RECORD_MISC_GUEST_KERNEL
> Sample happened in the guest kernel.
> .TP
> .B PERF_RECORD_MISC_GUEST_USER
> Sample happened in guest user code.
> .RE
>
> In addition one of the following bits can be set:
> .RS
> .TP
> .B PERF_RECORD_MISC_EXACT_IP
> This indicates that the content of
> .B PERF_SAMPLE_IP
> points
> to the actual instruction that triggered the event.
> See also
> .IR perf_event_attr.precise_ip .
>
> .TP
> .B PERF_RECORD_MISC_EXT_RESERVED
> This indicates there is extended data available (currently not used).
>
> .RE
>
> .TP
> .I size
> This indicates the size of the record.
>
> .RE
>
> .SS "Signal Overflow"
>
> Events can be set to deliver a signal when a threshold is crossed.
> The signal handler is set up using the
> .BR poll (2),
> .BR select (2),
> .BR epoll (2)
> and
> .BR fcntl (2),
> system calls.
>
> To generate signals, sampling must be enabled
> .RI ( sample_period
> must have a non-zero value).
>
> There are two ways to generate signals.
>
> The first is to set a
> .I wakeup_events
> or
> .I wakeup_watermark
> value that will generate a signal if a certain number of samples
> or bytes have been written to the mmap ring buffer.
> In this case a signal of type POLL_IN is sent.
>
> The other way is by use of the
> .I PERF_EVENT_IOC_REFRESH
> ioctl.
> This ioctl adds to a counter that decrements each time the event overflows.
> When non-zero, a POLL_IN signal is sent on overflow, but
> once the value reaches 0, a signal is sent of type POLL_HUP and
> the underlying event is disabled.
>
> Note: on newer kernels (definitely noticed with 3.2)
> .\" FIXME : Find out when this was introduced
> a signal is provided for every overflow, even if
> .I wakeup_events
> is not set.
>
> .SS "rdpmc instruction"
> Starting with Linux 3.4 on x86, you can use the
> .I rdpmc
> instruction to get low-latency reads without having to enter the kernel.
> Note that using
> .I rdpmc
> is not necessarily faster than other methods for reading event values.
>
> Support for this can be detected with the
> .I cap_usr_rdpmc
> field in the mmap page; documentation on how
> to calculate event values can be found in that section.
>
> .SS "perf_event ioctl calls"
> .PP
> Various ioctls act on
> .BR perf_event_open()
> file descriptors
>
> .TP
> .B PERF_EVENT_IOC_ENABLE
> Enables the individual event or event group specified by the fd.
>
> The ioctl argument is ignored.
>
> .TP
> .B PERF_EVENT_IOC_DISABLE
> Disables the individual counter or event group specified by the fd.
>
> Enabling or disabling the leader of a group enables or disables the
> entire group; that is, while the group leader is disabled, none of the
> counters in the group will count.
> Enabling or disabling a member of a group other than the leader only
> affects that counter; disabling a non-leader
> stops that counter from counting but doesn't affect any other counter.
>
> The ioctl argument is ignored.
>
> .TP
> .B PERF_EVENT_IOC_REFRESH
> Non-inherited overflow counters can use this
> to enable a counter for a number of overflows specified by the argument,
> after which it is disabled.
> Subsequent calls of this ioctl add the argument value to the current
> count.
> A signal with POLL_IN set will happen on each overflow until the
> count reaches 0; when that happens a signal with POLL_HUP set is
> sent and the event is disabled.
> Using an argument of 0 is considered undefined behavior.
>
> .TP
> .B PERF_EVENT_IOC_RESET
> Reset the event count specified by the fd to zero.
> This only resets the counts; there is no way to reset the
> multiplexing
> .I time_enabled
> or
> .I time_running
> values.
> When sent to a group leader, only
> the leader is reset (child events are not).
>
> The ioctl argument is ignored.
>
> .TP
> .B PERF_EVENT_IOC_PERIOD
> IOC_PERIOD is the command to update the period; it
> does not update the current period but instead defers until next.
>
> The argument is a pointer to a 64-bit value containing the
> desired new period.
>
> .TP
> .B PERF_EVENT_IOC_SET_OUTPUT
> This tells the kernel to report event notifications to the specified
> file descriptor rather than the default one.
> The file descriptors must all be on the same CPU.
>
> The argument specifies the desired file descriptor, or \-1 if
> output should be ignored.
>
> .TP
> .BR PERF_EVENT_IOC_SET_FILTER " (Since Linux 2.6.33)"
> This adds an ftrace filter to this event.
>
> The argument is a pointer to the desired ftrace filter.
>
> .SS "Using prctl"
> A process can enable or disable all the event groups that are
> attached to it using the
> .BR prctl (2)
> .B PR_TASK_PERF_EVENTS_ENABLE
> and
> .B PR_TASK_PERF_EVENTS_DISABLE
> operations.
> This applies to all counters on the current process, whether created by
> this process or by another, and does not affect any counters that this
> process has created on other processes.
> It only enables or disables
> the group leaders, not any other members in the groups.
>
> .SS perf_event related configuration files
>
> Files in /proc/sys/kernel/
>
> .RS
> .TP
> .I
> /proc/sys/kernel/perf_event_paranoid
>
> The
> .I perf_event_paranoid
> file can be set to restrict access to the performance counters.
>
> 2 - only allow userspace measurements
>
> 1 - (default) allow both kernel and user measurements
>
> 0 - allow access to CPU-specific data but not raw tracepoint samples
>
> \-1 - no restrictions
>
> The existence of the
> .I perf_event_paranoid
> file is the official method for determining if a kernel supports
> .BR perf_event_open().
>
> .TP
> .I /proc/sys/kernel/perf_event_max_sample_rate
>
> This sets the maximum sample rate. Setting this too high can allow
> users to sample at a rate that impacts overall machine performance
> and potentially lock up the machine. The default value is
> 100000 (samples per second).
>
> .TP
> .I /proc/sys/kernel/perf_event_mlock_kb
>
> Maximum number of pages an unprivledged user can mlock (2) .
> The default is 516 (kB).
> .RE
>
> Files in /sys/bus/event_source/devices/
>
> Since Linux 2.6.34 the kernel supports having multiple PMUs
> available for monitoring.
> Information on how to program these PMUs can be found under
> .IR /sys/bus/event_source/devices/ .
> Each subdirectory corresponds to a different PMU.
>
> .RS
> .TP
> .I /sys/bus/event_source/devices/*/type
> This contains an integer that can be used in the
> .I type
> field of perf_event_attr to indicate you wish to use this PMU.
>
> .TP
> .I /sys/bus/event_source/devices/*/rdpmc
> [To be documented]
>
> .TP
> .I /sys/bus/event_source/devices/*/format/
> This sub-directory contains information on what bits in the
> .I config
> field of perf_event_attr correspond to.
>
> .TP
> .I /sys/bus/event_source/devices/*/events/
> This sub-directory contains files with pre-defined events.
> The contents are strings describing the event settings
> expressed in terms of the fields found in the
> .I ./format/
> directory.
> These are not necessarily complete lists of all events supported by
> a PMU, but usually a subset of events deemed useful or interesting.
>
> .TP
> .I /sys/bus/event_source/devices/*/uevent
> [To be documented]
>
> .RE
>
>
> .SH "RETURN VALUE"
> .BR perf_event_open ()
> returns the new file descriptor, or \-1 if an error occurred
> (in which case,
> .I errno
> is set appropriately).
> .SH ERRORS
> .TP
> .B EINVAL
> Returned if the specified event is not available.
> .TP
> .B ENOSPC
> Prior to Linux 3.3, if there was not enough room for the event,
> .B ENOSPC
> was returned.
> Linus did not like this, and this was changed to
> .BR EINVAL .
> .B ENOSPC
> is still returned if you try to read results into
> too small of a buffer.
>
> .SH VERSION
>
> .BR perf_event_open ()
> was introduced in Linux 2.6.31 but was called
> .BR perf_counter_open () .
> It was renamed in Linux 2.6.32.
>
> .SH CONFORMING TO
>
> This call is specific to Linux
> and should not be used in programs intended to be portable.
>
> .SH NOTES
> Glibc does not provide a wrapper for this system call; call it using
> .BR syscall (2).
>
> The official way of knowing if
> .BR perf_event_open()
> support is enabled is checking
> for the existence of the file
> .I /proc/sys/kernel/perf_event_paranoid
>
> .SH BUGS
>
> The
> .B F_SETOWN_EX
> option to
> .IR fcntl (2)
> is needed to properly get overflow signals in threads.
> This was introduced in Linux 2.6.32.
>
> Prior to Linux 2.6.33 (at least for x86) the kernel did not check
> if events could be scheduled together until read time.
> The same happens on all known kernels if the NMI watchdog is enabled.
> This means to see if a given set of events works you have to
> .BR perf_event_open (),
> start, then read before you know for sure you
> can get valid measurements.
>
> Prior to Linux 2.6.34 event constraints were not enforced by the kernel.
> In that case, some events would silently return "0" if the kernel
> scheduled them in an improper counter slot.
>
> Prior to Linux 2.6.34 there was a bug when multiplexing where the
> wrong results could be returned.
>
> Kernels from Linux 2.6.35 to Linux 2.6.39 can quickly crash the kernel if
> "inherit" is enabled and many threads are started.
>
> Prior to Linux 2.6.35,
> .B PERF_FORMAT_GROUP
> did not work with attached processes.
>
> In older Linux 2.6 versions,
> refreshing an event group leader refreshed all siblings,
> and refreshing with a parameter of 0 enabled infinite refresh.
> This behavior is unsupported and should not be relied on.
>
> There is a bug in the kernel code between
> Linux 2.6.36 and Linux 3.0 that ignores the
> "watermark" field and acts as if a wakeup_event
> was chosen if the union has a
> non-zero value in it.
>
> Always double-check your results! Various generalized events
> have had wrong values.
> For example, retired branches measured
> the wrong thing on AMD machines until Linux 2.6.35.
>
> .SH EXAMPLE
> The following is a short example that measures the total
> instruction count of a call to printf().
> .nf
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <string.h>
> #include <sys/ioctl.h>
> #include <linux/perf_event.h>
> #include <asm/unistd.h>
>
> long perf_event_open( struct perf_event_attr *hw_event, pid_t pid,
> int cpu, int group_fd, unsigned long flags )
> {
> int ret;
>
> ret = syscall( __NR_perf_event_open, hw_event, pid, cpu,
> group_fd, flags );
> return ret;
> }
>
>
> int
> main(int argc, char **argv)
> {
>
> struct perf_event_attr pe;
> long long count;
> int fd;
>
> memset(&pe, 0, sizeof(struct perf_event_attr));
> pe.type = PERF_TYPE_HARDWARE;
> pe.size = sizeof(struct perf_event_attr);
> pe.config = PERF_COUNT_HW_INSTRUCTIONS;
> pe.disabled = 1;
> pe.exclude_kernel = 1;
> pe.exclude_hv = 1;
>
> fd = perf_event_open(&pe, 0, \-1, \-1, 0);
> if (fd < 0) {
> fprintf(stderr, "Error opening leader %llx\\n", pe.config);
> }
>
> ioctl(fd, PERF_EVENT_IOC_RESET, 0);
> ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
>
> printf("Measuring instruction count for this printf\\n");
>
> ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
> read(fd, &count, sizeof(long long));
>
> printf("Used %lld instructions\\n", count);
>
> close(fd);
> }
> .fi
>
> .SH "SEE ALSO"
> .BR fcntl (2),
> .BR mmap (2),
> .BR open (2),
> .BR prctl (2),
> .BR read (2)



--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Author of "The Linux Programming Interface"; http://man7.org/tlpi/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/