Re: [PATCH V3 0/3] perf tool: Haswell LBR call stack support (user)

From: Jiri Olsa
Date: Mon Nov 17 2014 - 11:01:45 EST


On Fri, Nov 14, 2014 at 08:44:09AM -0500, kan.liang@xxxxxxxxx wrote:
> From: Kan Liang <kan.liang@xxxxxxxxx>
>
> This is the user space patch for Haswell LBR call stack support.
> For many profiling tasks we need the callgraph. For example we often
> need to see the caller of a lock or the caller of a memcpy or other
> library function to actually tune the program. Frame pointer unwinding
> is efficient and works well. But frame pointers are off by default on
> 64bit code (and on modern 32bit gccs), so there are many binaries around
> that do not use frame pointers. Profiling unchanged production code is
> very useful in practice. On some CPUs frame pointer also has a high
> cost. Dwarf2 unwinding also does not always work and is extremely slow
> (upto 20% overhead).
>
> Haswell has a new feature that utilizes the existing Last Branch Record
> facility to record call chains. When the feature is enabled, function
> call will be collected as normal, but as return instructions are
> executed the last captured branch record is popped from the on-chip LBR
> registers. The LBR call stack facility provides an alternative to get
> callgraph. It has some limitations too, but should work in most cases
> and is significantly faster than dwarf. Frame pointer unwinding is still
> the best default, but LBR call stack is a good alternative when nothing
> else works.
>


---
> A new call chain recording option "lbr" is introduced into perf tool for
> LBR call stack. The user can use --call-graph lbr to get the call stack
> information from hardware.
>
> When profiling bc(1) on Fedora 19:
> echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph lbr bc -l < cmd
> If enabling LBR, perf report output looks like:
> 50.36% bc bc [.] bc_divide
> |
> --- bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
> 33.66% bc bc [.] _one_mult
> |
> --- _one_mult
> bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
> 7.62% bc bc [.] _bc_do_add
> |
> --- _bc_do_add
> |
> |--99.89%-- 0x2000186a8
> --0.11%-- [...]
> 6.83% bc bc [.] _bc_do_sub
> |
> --- _bc_do_sub
> |
> |--99.94%-- bc_add
> | execute
> | run_code
> | yyparse
> | main
> | __libc_start_main
> | _start
> --0.06%-- [...]
> 0.46% bc libc-2.17.so [.] __memset_sse2
> |
> --- __memset_sse2
> |
> |--54.13%-- bc_new_num
> | |
> | |--51.00%-- bc_divide
> | | execute
> | | run_code
> | | yyparse
> | | main
> | | __libc_start_main
> | | _start
> | |
> | |--30.46%-- _bc_do_sub
> | | bc_add
> | | execute
> | | run_code
> | | yyparse
> | | main
> | | __libc_start_main
> | | _start
> | |
> | --18.55%-- _bc_do_add
> | bc_add
> | execute
> | run_code
> | yyparse
> | main
> | __libc_start_main
> | _start
> |
> --45.87%-- bc_divide
> execute
> run_code
> yyparse
> main
> __libc_start_main
> _start
> If using FP, perf report output looks like:
> echo 'scale=2000; 4*a(1)' > cmd; perf record --call-graph fp bc -l < cmd
> 50.49% bc bc [.] bc_divide
> |
> --- bc_divide
> 33.57% bc bc [.] _one_mult
> |
> --- _one_mult
> 7.61% bc bc [.] _bc_do_add
> |
> --- _bc_do_add
> 0x2000186a8
> 6.88% bc bc [.] _bc_do_sub
> |
> --- _bc_do_sub
> 0.42% bc libc-2.17.so [.] __memcpy_ssse3_back
> |
> --- __memcpy_ssse3_back
>
> If using LBR, perf report -D output looks like:
> 11739295893248 0x4d0 [0xe0]: PERF_RECORD_SAMPLE(IP, 0x2): 10505/10505:
> 0x40054d period: 39255 addr: 0
> ... LBR call chain: nr:7
> ..... 0: fffffffffffffe00
> ..... 1: 0000000000400540
> ..... 2: 0000000000400587
> ..... 3: 00000000004005b3
> ..... 4: 00000000004005ef
> ..... 5: 0000003d1cc21b43
> ..... 6: 0000000000400474
> ... FP chain: nr:6
> ..... 0: fffffffffffffe00
> ..... 1: 000000000040054d
> ..... 2: 000000000040058c
> ..... 3: 00000000004005b8
> ..... 4: 00000000004005f4
> ..... 5: 0000003d1cc21b45
> ... thread: a.out:10505
> ...... dso: /home/lk/a.out
>
>
> The LBR call stack has following known limitations
> - Zero length calls are not filtered out by hardware
> - Exception handing such as setjmp/longjmp will have calls/returns not
> match
> - Pushing different return address onto the stack will have calls/returns
> not match
> - If callstack is deeper than the LBR, only the last entries are captured
---

also could you please add all above ^^^ as an additional text
for patch 3/3 changelog (perf tools: Construct LBR call chain)?

looks too nice to lose it ;-)

thanks,
jirka
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/