[PATCHSET 0/9] perf report: Improve srcline sort performance (v1)

From: Namhyung Kim
Date: Thu Dec 15 2022 - 14:28:28 EST


Hello,

I noticed a performance problem in the srcline/srcfile processing during
perf report when it's using an external addr2line process. I guess it's
also helpful even if it uses the libbfd to get the srcline info.

Also note that it's mostly from large (static) binaries, but smaller
binaries should also benefit from the fix if they have a lot of samples.

The first 5 patches are general fixes and updates. The latter 4 patches
implemented the actual speed-up.

Let's test it with the perf tools itself. Build a static binary like below.

$ cd tools/perf
$ make NO_JVMTI=1 LDFLAGS=-static

Then run the perf test workload.

$ ./perf record -- ./perf test -w noploop

And run the perf report with srcline sort key like this.

$ ./perf report -n -s srcline --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 4K of event 'cycles:u'
# Event count (approx.): 3572938596
#
# Overhead Samples Source:Line
# ........ ............ ............
#
99.94% 4010 noploop.c:26
0.03% 14 ??:0
0.03% 1 perf.c:330
0.00% 1 wcscpy.o:0

The problem is that it runs the addr2line when it processes each sample.
But as you can see many samples can have same result. IOW, if the samples
have same address, we don't need to run the addr2line each time.

So I changed the sort_key->cmp() to compare the addresses only and moved
the addr2line from sort_key->collapse() so that they can be run after
merging the samples with the same address.

With the change, I can get a huge speed-up in processing srcline info
while they generate the same output.

Before:

$ ./perf stat -- ./perf report -s srcline > /dev/null

Performance counter stats for './perf report -s srcline':

15,397.13 msec task-clock:u # 0.993 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3,810 page-faults:u # 247.449 /sec
54,516,351,820 cycles:u # 3.541 GHz
31,494,118,293 instructions:u # 0.58 insn per cycle
8,577,271,187 branches:u # 557.069 M/sec
1,216,165,520 branch-misses:u # 14.18% of all branches

15.505066606 seconds time elapsed

15.094122000 seconds user
0.396962000 seconds sys

After:

$ ./perf stat -- ./perf report -s srcline > /dev/null

Performance counter stats for './perf report -s srcline':

105.66 msec task-clock:u # 0.994 CPUs utilized
0 context-switches:u # 0.000 /sec
0 cpu-migrations:u # 0.000 /sec
3,275 page-faults:u # 30.995 K/sec
185,063,407 cycles:u # 1.751 GHz
142,470,215 instructions:u # 0.77 insn per cycle
34,584,038 branches:u # 327.311 M/sec
3,226,005 branch-misses:u # 9.33% of all branches

0.106270464 seconds time elapsed

0.074254000 seconds user
0.032871000 seconds sys

The code is available at 'perf/srcline-v1' branch in

git://git.kernel.org/pub/scm/linux/kernel/git/namhyung/linux-perf.git

Thanks,
Namhyung


Namhyung Kim (9):
perf srcline: Do not return NULL for srcline
perf report: Ignore SIGPIPE for srcline
perf symbol: Add filename__has_section()
perf srcline: Skip srcline if .debug_line is missing
perf srcline: Conditionally suppress addr2line warnings
perf hist: Add perf_hpp_fmt->init() callback
perf hist: Improve srcline sort key performance
perf hist: Improve srcfile sort key performance
perf hist: Improve srcline_{from,to} sort key performance

tools/perf/builtin-report.c | 1 +
tools/perf/util/hist.c | 10 +--
tools/perf/util/hist.h | 1 +
tools/perf/util/sort.c | 129 ++++++++++++++++++++++++++++---
tools/perf/util/sort.h | 1 +
tools/perf/util/srcline.c | 20 +++--
tools/perf/util/symbol-elf.c | 28 +++++++
tools/perf/util/symbol-minimal.c | 5 ++
tools/perf/util/symbol.h | 1 +
9 files changed, 176 insertions(+), 20 deletions(-)


base-commit: 818448e9cf92e5c6b3c10320372eefcbe4174e4f
--
2.39.0.314.g84b9a713c41-goog