Re: [PATCH v6 00/18] perf: add support for sampling taken branches

From: Stephane Eranian
Date: Mon Feb 27 2012 - 03:45:34 EST


On Mon, Feb 27, 2012 at 8:50 AM, Anshuman Khandual
<khandual@xxxxxxxxxxxxxxxxxx> wrote:
> On Friday 10 February 2012 03:50 AM, Stephane Eranian wrote:
>> This patchset adds an important and useful new feature to
>> perf_events: branch stack sampling. In other words, the
>> ability to capture taken branches into each sample.
>>
>> Statistical sampling of taken branch should not be confused
>> for branch tracing. Not all branches are necessarily captured
>>
>> Sampling taken branches is important for basic block profiling,
>> statistical call graph, function call counts. Many of those
>> measurements can help drive a compiler optimizer.
>>
>> The branch stack is a software abstraction which sits on top
>> of the PMU hardware. As such, it is not available on all
>> processors. For now, the patch provides the generic interface
>> and the Intel X86 implementation where it leverages the Last
>> Branch Record (LBR) feature (from Core2 to SandyBridge).
>>
>> Branch stack sampling is supported for both per-thread and
>> system-wide modes.
>>
>> It is possible to filter the type and privilege level of branches
>> to sample. The target of the branch is used to determine
>> the privilege level.
>>
>> For each branch, the source and destination are captured. On
>> some hardware platforms, it may be possible to also extract
>> the target prediction and, in that case, it is also exposed
>> to end users.
>>
>> The branch stack can record a variable number of taken
>> branches per sample. Those branches are always consecutive
>> in time. The number of branches captured depends on the
>> filtering and the underlying hardware. On Intel Nehalem
>> and later, up to 16 consecutive branches can be captured
>> per sample.
>>
>> Branch sampling is always coupled with an event. It can
>> be any PMU event but it can't be a SW or tracepoint event.
>>
>> Branch sampling is requested by setting a new sample_type
>> flag called: PERF_SAMPLE_BRANCH_STACK.
>>
>> To support branch filtering, we introduce a new field
>> to the perf_event_attr struct: branch_sample_type. We chose
>> NOT to overload the config1, config2 field because those
>> are related to the event encoding. Branch stack is a
>> separate feature which is combined with the event.
>>
>> The branch_sample_type is a bitmask of possible filters.
>> The following filters are defined (more can be added):
>> - PERF_SAMPLE_BRANCH_ANY Â Â : any control flow change
>> - PERF_SAMPLE_BRANCH_USER Â Â: branches when target is at user level
>> - PERF_SAMPLE_BRANCH_KERNEL Â: branches when target is at kernel level
>> - PERF_SAMPLE_BRANCH_HV Â Â Â: branches when target is at hypervisor level
>> - PERF_SAMPLE_BRANCH_ANY_CALL: call branches (incl. syscalls)
>> - PERF_SAMPLE_BRANCH_ANY_RET : return branches (incl. syscall returns)
>> - PERF_SAMPLE_BRANCH_IND_CALL: indirect calls
>>
>> It is possible to combine filters, e.g., IND_CALL|USER|KERNEL.
>>
>> When the privilege level is not specified, the branch stack
>> inherits that of the associated event.
>>
>> Some processors may not offer hardware branch filtering, e.g., Intel
>> Atom. Some may have HW filtering bugs (e.g., Nehalem). The Intel
>> X86 implementation in this patchset also provides a SW branch filter
>> which works on a best effort basis. It can compensate for the lack
>> of LBR filtering. But first and foremost, it helps work around LBR
>> filtering errata. The goal is to only capture the type of branches
>> requested by the user.
>>
>> It is possible to combine branch stack sampling with PEBS on Intel
>> X86 processors. Depending on the precise_sampling mode, there are
>> certain filterting restrictions. When precise_sampling=1, then
>> there are no filtering restrictions. When precise_sampling > 1,
>> then only ANY|USER|KERNEL filter can be used. This comes from
>> the fact that the kernel uses LBR to compensate for the PEBS
>> off-by-1 skid on the instruction pointer.
>>
>> To demonstrate how the perf_event branch stack sampling interface
>> works, the patchset also modifies perf record to capture taken
>> branches. Similarly perf report is enhanced to display a histogram
>> of taken branches.
>>
>> I would like to thank Roberto Vitillo @ LBL for his work on the perf
>> tool for this.
>>
>> Enough talking, let's take a simple example. Our trivial test program
>> goes like this:
>>
>> void f2(void)
>> {}
>> void f3(void)
>> {}
>> void f1(unsigned long n)
>> {
>> Â if (n & 1UL)
>> Â Â f2();
>> Â else
>> Â Â f3();
>> }
>> int main(void)
>> {
>> Â unsigned long i;
>>
>> Â for (i=0; i < N; i++)
>> Â Âf1(i);
>> Â return 0;
>> }
>>
>> $ perf record -b any branchy
>> $ perf report -b
>> # Events: 23K cycles
>> #
>> # Overhead ÂSource Symbol   Target Symbol
>> # ........ Â................ Â................
>>
>> Â Â 18.13% Â[.] f1 Â Â Â Â Â Â[.] main
>>   18.10% Â[.] main     Â[.] main
>>   18.01% Â[.] main     Â[.] f1
>> Â Â 15.69% Â[.] f1 Â Â Â Â Â Â[.] f1
>> Â Â Â9.11% Â[.] f3 Â Â Â Â Â Â[.] f1
>> Â Â Â6.78% Â[.] f1 Â Â Â Â Â Â[.] f3
>> Â Â Â6.74% Â[.] f1 Â Â Â Â Â Â[.] f2
>> Â Â Â6.71% Â[.] f2 Â Â Â Â Â Â[.] f1
>>
>> Of the total number of branches captured, 18.13% were from f1() -> main().
>>
>> Let's make this clearer by filtering the user call branches only:
>>
>> $ perf record -b any_call -e cycles:u branchy
>> $ perf report -b
>> # Events: 19K cycles
>> #
>> # Overhead ÂSource Symbol       ÂTarget Symbol
>> # ........ Â......................... Â.........................
>> #
>>   52.50% Â[.] main          [.] f1
>> Â Â 23.99% Â[.] f1 Â Â Â Â Â Â Â Â Â Â [.] f3
>> Â Â 23.48% Â[.] f1 Â Â Â Â Â Â Â Â Â Â [.] f2
>>   Â0.03% Â[.] _IO_default_xsputn   [.] _IO_new_file_overflow
>>   Â0.01% Â[k] _start         [k] __libc_start_main
>>
>> Now it is more obvious. %52 of all the captured branches where calls from main() -> f1().
>> The rest is split 50/50 between f1() -> f2() and f1() -> f3() which is expected given
>> that f1() dispatches based on odd vs. even values of n which is constantly increasing.
>>
>>
>> Here is a kernel example, where we want to sample indirect calls:
>> $ perf record -a -C 1 -b ind_call -e r1c4:k sleep 10
>> $ perf report -b
>> #
>> # Overhead ÂSource Symbol        Target Symbol
>> # ........ Â.......................... Â..........................
>> #
>>   36.36% Â[k] __delay         [k] delay_tsc
>>   Â9.09% Â[k] ktime_get        [k] read_tsc
>>   Â9.09% Â[k] getnstimeofday     Â[k] read_tsc
>>   Â9.09% Â[k] notifier_call_chain   [k] tick_notify
>>   Â4.55% Â[k] cpuidle_idle_call    [k] intel_idle
>>   Â4.55% Â[k] cpuidle_idle_call    [k] menu_reflect
>>   Â2.27% Â[k] handle_irq       Â[k] handle_edge_irq
>>   Â2.27% Â[k] ack_apic_edge      [k] native_apic_mem_write
>> Â Â Â2.27% Â[k] hpet_interrupt_handler Â[k] hrtimer_interrupt
>>   Â2.27% Â[k] __run_hrtimer      [k] watchdog_timer_fn
>>   Â2.27% Â[k] enqueue_task      Â[k] enqueue_task_rt
>>   Â2.27% Â[k] try_to_wake_up     Â[k] select_task_rq_rt
>>   Â2.27% Â[k] do_timer        Â[k] read_tsc
>>
>> Due to HW limitations, branch filtering may be approximate on
>> Core, Atom processors. It is more accurate on Nehalem, Westmere
>> and best on Sandy Bridge.
>>
>> In version 2, we've updated the patch to tip/master (commit 5734857) and
>> we've incoporated the feedback from v1 concerning anynous bitfield
>> struct for branch_stack_entry and the hanlding of i386 ABI binaries
>> on 64-bit host in the instr decoder for the LBR SW filter.
>>
>> In version 3, we've updated to 3.2.0-tip. The Atom revision
>> check has been put into its own patch. We fixed a browser
>> issue with report report. We fixed all the style issues as well.
>>
>> In version 4, we've modified the branch stack API to add a missing
>> priv level : hypervisor. There is a new PERF_SAMPLE_BRANCH_HV. It
>> is not used on Intel X86. Thanks to Âkhandual@xxxxxxxxxxxxxxxxxx
>> for pointing this out. We also fix compilation error on ARM.
>>
>> In version 4, we also extend the patch to include the changes necessary
>> to the perf tool to support reading perf.data files which were produced
>> from older perf_event ABI revisions. This patch set extends the ABI
>> with a new field in struct perf_event_attr. That struct is saved as
>> is in the perf.data file. Therefore, older perf.data files contain
>> smaller perf_event_attr struct, yet perf must process them transparently.
>> That's not the case today. It dies with 'incompatible file format'.
>>
>> The patch solves this problem and, at the same time, decouples endianness
>> detection from the size of perf_event_attr. Endianness is now detected via
>> the signature (the first 8 bytes of the file). We introduce a new signature
>> (PERFILE2). It is not laid out the same way in the file based on the endianness
>> of the host where the file is written. Therefore, we can dynamically detect
>> the endianness by simply reading the first 8 bytes. The size of the
>> perf_event_attr struct can then be processed according to the endianness.
>> The ambiguity between the size being at the same time, the endianness marker
>> and the actual size is gone. We can now distinguish an older ABI by the size
>> and not confuse it with an endianness mismatch.
>>
>> In version 5, we fix the PEBS+LBR vs. BRANCH_STACK check in x86_pmu_hw_config.
>> We also changed the handling of PERF_SAMPLE_BRANCH_HV on X86. It is now ignored
>> instead of triggering an error. That enables: perf record -b any -e cycles,
>> without having to force a priv level on the branch type. We also fix an
>> uninitialized variable bug in the perf tool reported by reviewers. Thanks
>> to Anshuman Khandual for his comments.
>>
>> In version 6, we have fixed several issues in the perf tool code and
>> especially in patch 11. We have fully implemented the --sort option
>> on the branch source and target. We have fixed several column alignment
>> issues. We have integrated feedback from David Ahern, concerning patch
>> 16 and the ability to read perf.data files written from a different
>> ABI. Perf will now reject any perf.data file that has a larger perf_event_attr
>> size. Perf provides compatibility backward and not forward.
>
>
> Hey Stephane,
>
> Could you please specify which tip tree I can apply and try out the
> V6 patchset ? Thank you.
>
V6 was posted prior to the jump_label changes. I suspect
any tip before:

77a73e5 static keys: Introduce 'struct static_key', very_[un]likely(),
static_key_slow_[inc|dec]()

Should work, though I have not tried.

>
>>
>> Signed-off-by: Stephane Eranian <eranian@xxxxxxxxxx>
>>
>>
>> Roberto Agostino Vitillo (3):
>> Â perf: add code to support PERF_SAMPLE_BRANCH_STACK
>> Â perf: add support for sampling taken branch to perf record
>> Â perf: add support for taken branch sampling to perf report
>>
>> Stephane Eranian (15):
>> Â perf: add generic taken branch sampling support
>> Â perf: add Intel LBR MSR definitions
>> Â perf: add Intel X86 LBR sharing logic
>> Â perf: sync branch stack sampling with X86 precise_sampling
>> Â perf: add Intel X86 LBR mappings for PERF_SAMPLE_BRANCH filters
>> Â perf: disable LBR support for older Intel Atom processors
>> Â perf: implement PERF_SAMPLE_BRANCH for Intel X86
>> Â perf: add LBR software filter support for Intel X86
>> Â perf: disable PERF_SAMPLE_BRANCH_* when not supported
>> Â perf: add hook to flush branch_stack on context switch
>> Â perf: fix endianness detection in perf.data
>> Â perf: add ABI reference sizes
>> Â perf: enable reading of perf.data files from different ABI rev
>> Â perf: fix bug print_event_desc()
>> Â perf: make perf able to read file from older ABIs
>>
>> Âarch/alpha/kernel/perf_event.c       |  Â4 +
>> Âarch/arm/kernel/perf_event.c        |  Â4 +
>> Âarch/mips/kernel/perf_event_mipsxx.c    |  Â4 +
>> Âarch/powerpc/kernel/perf_event.c      |  Â4 +
>> Âarch/sh/kernel/perf_event.c        Â|  Â4 +
>> Âarch/sparc/kernel/perf_event.c       |  Â4 +
>> Âarch/x86/include/asm/msr-index.h      |  Â7 +
>> Âarch/x86/kernel/cpu/perf_event.c      |  85 ++++-
>> Âarch/x86/kernel/cpu/perf_event.h      |  19 +
>> Âarch/x86/kernel/cpu/perf_event_amd.c    |  Â3 +
>> Âarch/x86/kernel/cpu/perf_event_intel.c   | Â120 +++++--
>> Âarch/x86/kernel/cpu/perf_event_intel_ds.c Â| Â 22 +-
>> Âarch/x86/kernel/cpu/perf_event_intel_lbr.c | Â526 ++++++++++++++++++++++++++--
>> Âinclude/linux/perf_event.h         |  82 ++++-
>> Âkernel/events/core.c            | Â177 ++++++++++
>> Âkernel/events/hw_breakpoint.c       Â|  Â6 +
>> Âtools/perf/Documentation/perf-record.txt  |  25 ++
>> Âtools/perf/Documentation/perf-report.txt  |  Â7 +
>> Âtools/perf/builtin-record.c        Â|  74 ++++
>> Âtools/perf/builtin-report.c        Â|  98 +++++-
>> Âtools/perf/perf.h             Â|  18 +
>> Âtools/perf/util/annotate.c         |  Â2 +-
>> Âtools/perf/util/event.h          Â|  Â1 +
>> Âtools/perf/util/evsel.c          Â|  14 +
>> Âtools/perf/util/header.c          | Â230 +++++++++++--
>> Âtools/perf/util/hist.c           |  93 ++++-
>> Âtools/perf/util/hist.h           |  Â7 +
>> Âtools/perf/util/session.c         Â|  72 ++++
>> Âtools/perf/util/session.h         Â|  Â4 +
>> Âtools/perf/util/sort.c           | Â362 ++++++++++++++-----
>> Âtools/perf/util/sort.h           |  Â5 +
>> Âtools/perf/util/symbol.h          |  13 +
>> Â32 files changed, 1866 insertions(+), 230 deletions(-)
>>
>
>
> --
> Anshuman Khandual
> Linux Technology Centre
> IBM Systems and Technology Group
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/