Re: perf performance with libdw vs libunwind
From: Ian Rogers
Date: Thu Apr 23 2026 - 00:22:11 EST
On Wed, Apr 22, 2026 at 6:26 AM Guilherme Amadio <amadio@xxxxxxxxxx> wrote:
>
> Dear Ian,
>
> Now that linux-7.0 is out, I've updated perf in Gentoo and moved it
> to use libdw, as libunwind has been deprecated. However, when I tried
> to use perf, I noticed a substantial performance regression and some
> other problems, which I report below.
>
> I use here an example which is my own "standard candle"¹ for checking
> that stack unwinding is working properly: the startup of ROOT², which
> is a C++ interpreter heavily used in high energy physics data analysis.
> I simply run 'root -l -q' which is the equivalent of 'python -c ""' for
> ROOT. It takes less than a second to run, but since it runs a full
> initialization of Clang/LLVM as part of the interpreter, it produces a
> rich flamegraph that I know ahead of time how it should look like, so
> I use it to check that stack unwinding and symbol resolution are working.
>
> 1. https://en.wikipedia.org/wiki/Cosmic_distance_ladder#Standard_candles
> 2. https://root.cern
>
> Below I show a comparison of the timings of perf record/report for this.
>
> First, I run it with perf-6.19.12 which is configured to use libunwind:
>
> $ perf config
> call-graph.record-mode=fp
> $ perf -vv
> perf version 6.19.12
> aio: [ on ] # HAVE_AIO_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> bpf_skeletons: [ on ] # HAVE_BPF_SKEL
> debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
> dwarf: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
> libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
> libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
> libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpfm4: [ on ] # HAVE_LIBPFM
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
> libunwind: [ on ] # HAVE_LIBUNWIND_SUPPORT
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> zstd: [ on ] # HAVE_ZSTD_SUPPORT
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 79800 Hz
>
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.688 MB perf.data (19693 samples) ]
> 1.25
> $ time perf report -q --stdio -g none --children --percent-limit 75
> 92.63% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.63% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
> 92.63% 0.00% root.exe root.exe [.] _start
> 92.63% 0.00% root.exe root.exe [.] main
> 91.53% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
> 89.36% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
> 88.18% 0.00% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
> 88.10% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
> 88.08% 0.00% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
> 75.62% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
> 75.62% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
> 1.86
> $ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-6.19-libunwind.stacks
> 4.08
> $ flamegraph.pl -w 2560 --title 'Flame Graph: ROOT Startup' --subtitle 'Created with perf-6.19.12 using libunwind' < root-perf-6.19-libunwind.stacks >| root-perf-6.19-libunwind.svg
>
> So, as you can see above, a simple perf-report took 1.86 seconds, and
> perf-script took 4.08 seconds with libunwind. Now with perf upgraded to
> perf-7.0 with libdw, this is what I see:
>
> $ perf -vv
> perf version 7.0
> aio: [ on ] # HAVE_AIO_SUPPORT
> bpf: [ on ] # HAVE_LIBBPF_SUPPORT
> bpf_skeletons: [ on ] # HAVE_BPF_SKEL
> debuginfod: [ on ] # HAVE_DEBUGINFOD_SUPPORT
> dwarf: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf_getlocations: [ on ] # HAVE_LIBDW_SUPPORT
> dwarf-unwind: [ on ] # HAVE_DWARF_UNWIND_SUPPORT
> libbfd: [ OFF ] # HAVE_LIBBFD_SUPPORT ( tip: Deprecated, license incompatibility, use BUILD_NONDISTRO=1 and install binutils-dev[el] )
> libbabeltrace: [ on ] # HAVE_LIBBABELTRACE_SUPPORT
> libbpf-strings: [ on ] # HAVE_LIBBPF_STRINGS_SUPPORT
> libcapstone: [ on ] # HAVE_LIBCAPSTONE_SUPPORT
> libdw-dwarf-unwind: [ on ] # HAVE_LIBDW_SUPPORT
> libelf: [ on ] # HAVE_LIBELF_SUPPORT
> libLLVM: [ on ] # HAVE_LIBLLVM_SUPPORT
> libnuma: [ on ] # HAVE_LIBNUMA_SUPPORT
> libopencsd: [ OFF ] # HAVE_CSTRACE_SUPPORT
> libperl: [ on ] # HAVE_LIBPERL_SUPPORT
> libpfm4: [ on ] # HAVE_LIBPFM
> libpython: [ on ] # HAVE_LIBPYTHON_SUPPORT
> libslang: [ on ] # HAVE_SLANG_SUPPORT
> libtraceevent: [ on ] # HAVE_LIBTRACEEVENT
> libunwind: [ OFF ] # HAVE_LIBUNWIND_SUPPORT ( tip: Deprecated, use LIBUNWIND=1 and install libunwind-dev[el] to build with it )
> lzma: [ on ] # HAVE_LZMA_SUPPORT
> numa_num_possible_cpus: [ on ] # HAVE_LIBNUMA_SUPPORT
> zlib: [ on ] # HAVE_ZLIB_SUPPORT
> zstd: [ on ] # HAVE_ZSTD_SUPPORT
> rust: [ on ] # HAVE_RUST_SUPPORT
> $ time perf record -g -F max -e cycles:uk -- root.exe -l -q
> info: Using a maximum frequency rate of 79800 Hz
>
> [ perf record: Woken up 23 times to write data ]
> [ perf record: Captured and wrote 5.766 MB perf.data (19922 samples) ]
> 1.28
> $ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
> 92.44% 0.00% root.exe root.exe [.] main
> 92.44% 0.00% root.exe root.exe [.] _start
> 92.44% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 87.95% 0.00% root.exe libCore.so.6.38.04 [.] GetROOT2 (inlined)
> 75.78% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
>
> Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
> User time (seconds): 250.33
> System time (seconds): 21.18
> Percent of CPU this job got: 99%
> ** Elapsed (wall clock) time (h:mm:ss or m:ss): 4:32.97
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> ** Maximum resident set size (kbytes): 4433000
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 7
> Minor (reclaiming a frame) page faults: 9850739
> Voluntary context switches: 226
> Involuntary context switches: 11388
> Swaps: 0
> File system inputs: 80776
> File system outputs: 232
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> After seeing how much memory perf was using, I decided to record that
> too, so as you can see above, perf 7.0 with libdw took 4 minutes and 33
> seconds for the same simple perf-report that took 1.86 seconds before,
> and the names of the symbols are not as complete as with libunwind.
Hi Guilherme,
Thanks for the feedback but I'm a little confused. Your .perfconfig is
set to use frame-pointer-based unwinding, so neither libunwind nor
libdw should be used for unwinding. With framepointer unwinding, a
sample contains an array of IPs gathered by walking the linked list of
frame pointers on the stack. With --call-graph=dwarf a region of
memory is copied into a sample (the stack) along with some initial
register values, libdw or libunwind is then used to process this
memory using the dwarf information in the ELF binary.
> Stack unwinding itself also seems inconsistent with the previous run.
> Here's the equivalent with perf-6.19.12:
>
> $ /usr/bin/time -v perf report -q --stdio -g none --children --percent-limit 75
> 92.46% 0.00% root.exe libc.so.6 [.] __libc_start_call_main
> 92.46% 0.00% root.exe libc.so.6 [.] __libc_start_main@@GLIBC_2.34
> 92.46% 0.00% root.exe root.exe [.] _start
> 92.46% 0.00% root.exe root.exe [.] main
> 91.38% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::GetROOT()
> 89.24% 0.00% root.exe libRint.so.6.38.04 [.] TRint::TRint(char const*, int*, char**, void*, int, bool, bool)
> 88.05% 0.01% root.exe libCore.so.6.38.04 [.] TApplication::TApplication(char const*, int*, char**, void*, int)
> 87.96% 0.01% root.exe libCore.so.6.38.04 [.] ROOT::Internal::GetROOT2()
> 87.95% 0.01% root.exe libCore.so.6.38.04 [.] TROOT::InitInterpreter()
> 75.50% 0.00% root.exe libCling.so.6.38.04 [.] CreateInterpreter
> 75.50% 0.00% root.exe libCling.so.6.38.04 [.] TCling::TCling(char const*, char const*, char const* const*, void*)
>
>
> Command being timed: "perf report -q --stdio -g none --children --percent-limit 75"
> User time (seconds): 1.79
> System time (seconds): 0.08
> Percent of CPU this job got: 99%
> Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.87
> Average shared text size (kbytes): 0
> Average unshared data size (kbytes): 0
> Average stack size (kbytes): 0
> Average total size (kbytes): 0
> Maximum resident set size (kbytes): 265108
> Average resident set size (kbytes): 0
> Major (requiring I/O) page faults: 0
> Minor (reclaiming a frame) page faults: 37887
> Voluntary context switches: 4
> Involuntary context switches: 77
> Swaps: 0
> File system inputs: 0
> File system outputs: 8
> Socket messages sent: 0
> Socket messages received: 0
> Signals delivered: 0
> Page size (bytes): 4096
> Exit status: 0
>
> Then, this is perf-script:
>
> $ time perf script -c root.exe | ~/bin/stackcollapse.pl --all >| root-perf-7.0-libdw.stacks
> cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
> ... (line repeated many times)
> cmd__addr2line /home/amadio/.debug/.build-id/30/d45df71aac8ef054ca89646b179487af641c45/elf: could not read first record
> 273.49
>
> I see many of these cms__addr2line errors, and it takes 273.49 seconds
> compared with 4.08 seconds with perf-6.19.12, and the flamegraph has
> abbreviated function names like "operator()" instead of the full name,
> which is also somewhat problematic as there's loss of information
> relative to what libunwind used to provide. The flamegraphs for the two
> runs above are available at https://cern.ch/amadio/perf I didn't want to
> attach the files here as I don't want to send big files to the lists.
> For the record, I am using libunwind-1.8.3 and elfutils-0.195 in these tests.
>
> If you'd like to perform the same kind of test, you can install ROOT
> from EPEL on a RHEL-like distribution inside a container with a simple
> "dnf install root", or just try the same record/report commands with a
> clang++ compilation of a simple program as a decent replacement.
So something that changed in v7.0 is that with the dwarf libdw or
libunwind unwinding we always had all inline functions on the stack
but not with frame pointers. The IP in the frame pointer array can be
of an instruction within an inlined function. In v7.0 we added a patch
that includes inline information for both frame pointer and LBR-based
stack traces:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/tools/perf/util/machine.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf
By default we try to add inline information using libdw if that fails
we try llvm, then libbfd and finally the command line addr2line tool:
https://web.git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/perf/util/srcline.c?h=perf-tools-next&id=28cb835f7645892f4559b92fcfeb25a81646f4cf#n145
I suspect the slow down is for doing all this addr2line work on a
binary that's been stripped. The good news here is that if you can add
a config option to avoid all the fallbacks, "addr2line.style=libdw".
You can also disable inline information by adding "--no-inline" to
your `perf report` command line.
Your report suggests we should tweak the defaults for showing inline
information. Could you try the options I've suggested and see if they
remedy the issue for you?
Many thanks,
Ian
> Best regards,
> -Guilherme