Re: perf performance with libdw vs libunwind

From: Ian Rogers

Date: Thu Apr 23 2026 - 18:28:32 EST

On Thu, Apr 23, 2026 at 2:49 AM Guilherme Amadio <amadio@xxxxxxxxxx> wrote:
>
> Hi Ian,
>
> On Wed, Apr 22, 2026 at 09:21:52PM -0700, Ian Rogers wrote:
> > Hi Guilherme,
> >
> > Thanks for the feedback but I'm a little confused. Your .perfconfig is
> > set to use frame-pointer-based unwinding, so neither libunwind nor
> > libdw should be used for unwinding. With framepointer unwinding, a
> > sample contains an array of IPs gathered by walking the linked list of
> > frame pointers on the stack. With --call-graph=dwarf a region of
> > memory is copied into a sample (the stack) along with some initial
> > register values, libdw or libunwind is then used to process this
> > memory using the dwarf information in the ELF binary.
>
> Thanks for your reply and pardon my ignorance, I thought that the libraries
> were generically used for stack unwinding, regardless of if it's fp or dwarf.
> I should have looked a bit deeper before reporting this, but we are on the
> right track.
>
> > So something that changed in v7.0 is that with the dwarf libdw or
> > libunwind unwinding we always had all inline functions on the stack
> > but not with frame pointers. The IP in the frame pointer array can be
> > of an instruction within an inlined function. In v7.0 we added a patch
> > that includes inline information for both frame pointer and LBR-based
> > stack traces:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf
>
> This is a nice development, I have been using --call-graph=dwarf to see
> the inlined symbols, so having the ability to see inlined functions with
> fp unwinding, which is much more lightweight in terms of space (i.e. the
> size of the final perf.data files), is great.
>
> > By default we try to add inline information using libdw if that fails
> > we try llvm, then libbfd and finally the command line addr2line tool:
> > https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
> > I suspect the slow down is for doing all this addr2line work on a
> > binary that's been stripped. The good news here is that if you can add
> > a config option to avoid all the fallbacks, "addr2line.style=libdw".
> > You can also disable inline information by adding "--no-inline" to
> > your `perf report` command line.
>
> The binary and its direct dependent libraries, as well as most other dependencies
> are not stripped, but it's possible that some dependency in the full chain might be.
>
> When I run perf report using --no-inline I indeed recover the performance I
> had before with perf-6.19.12. However, setting addr2line.style=libdw did not
> help much. Here is what I observe:
>
> $ perf config
> call-graph.record-mode=fp
> $ perf version
> perf version 7.0
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 63800 Hz
> [ perf record: Woken up 18 times to write data ]
> [ perf record: Captured and wrote 4.720 MB perf.data (16552 samples) ]
> 1.26
> $ time perf report -q --stdio -g none --children --no-inline --percent-limit 75
> 92.65% 0.00% root.exe root.exe [.] main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
> 92.64% 0.00% root.exe root.exe [.] _start
> 91.63% 0.00% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
> 89.46% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
> 88.31% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
> 88.22% 0.00% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
> 88.20% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
> 76.34% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
> 76.34% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
> 1.70
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-noinline.svg

Thanks for all the reporting!
So here the C++ demangler is about 1/3rd of execution time and there's
no dwarf decoding for the inline functions.

> Now without --no-inline, and this first command is without addr2line.style=libdw in the config:
>
> $ time perf report -q --stdio -g none --children --percent-limit 75
> 92.65% 0.00% root.exe root.exe [.] main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.64% 0.00% root.exe root.exe [.] _start
> 88.22% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
> 76.34% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
>
> 241.60
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-addr2line.svg

Here perf is using libdw trying to do the addr2line and then it is
using the addr2line command to do it. Time is mainly spent gathering
addr2line inline information.

> $ perf config addr2line.style=libdw
> $ perf config
> call-graph.record-mode=fp
> addr2line.style=libdw
> $ time perf report -q --stdio -g none --children --percent-limit 75
> 92.65% 0.00% root.exe root.exe [.] main
> 92.64% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.64% 0.00% root.exe root.exe [.] _start
> 88.22% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
>
> 137.93
> Flamegraph: https://amadio.web.cern.ch/perf/perf-report-libdw.svg

Here time is just spent in libdw.

> The flame graphs above are for the perf-report commands themselves.
>
> So, the performance is fine with --no-inline, and it's better with addr2line.style=libdw.
> However, the function names are not the best in the last two reports, so this problem remains.
>
> > Your report suggests we should tweak the defaults for showing inline
> > information. Could you try the options I've suggested and see if they
> > remedy the issue for you?
>
> Thank you for the suggestions. Indeed --no-inline seems to bring back the
> previous performance. Please let me know if you would like me to try more
> things and what other information you need for the cases without --no-inline.

Performance-wise, things are working as expected. I'm confused about
why we see different symbol names, perhaps this points to a libdw bug.
With or without --no-inline libelf gets the symbol name, libdw is only
used to get the source line and inlining information. Perhaps this is
more of a bug with `-g none`, which is an option I've never used. I'm
quite busy at the moment, so it's not easy for me to dig into this.
Perhaps we can create a test and try to get an LLM to investigate it.

Thanks,
Ian

> Best regards,
> -Guilherme