[RFC PATCH v3 00/14] Share events between metrics

From: Ian Rogers
Date: Fri May 08 2020 - 01:36:39 EST


Metric groups contain metrics. Metrics create groups of events to
ideally be scheduled together. Often metrics refer to the same events,
for example, a cache hit and cache miss rate. Using separate event
groups means these metrics are multiplexed at different times and the
counts don't sum to 100%. More multiplexing also decreases the
accuracy of the measurement.

This change orders metrics from groups or the command line, so that
the ones with the most events are set up first. Later metrics see if
groups already provide their events, and reuse them if
possible. Unnecessary events and groups are eliminated.

The option --metric-no-group is added so that metrics aren't placed in
groups. This affects multiplexing and may increase sharing.

The option --metric-mo-merge is added and with this option the
existing grouping behavior is preserved.

RFC because:
- without this change events within a metric may get scheduled
together, after they may appear as part of a larger group and be
multiplexed at different times, lowering accuracy - however, less
multiplexing may compensate for this.
- libbpf's hashmap is used, however, libbpf is an optional
requirement for building perf.
- other things I'm not thinking of.

Thanks!

Example on Sandybridge:

$ perf stat -a --metric-no-merge -M TopDownL1_SMT sleep 1

Performance counter stats for 'system wide':

14931177 cpu_clk_unhalted.one_thread_active # 0.47 Backend_Bound_SMT (12.45%)
32314653 int_misc.recovery_cycles_any (16.23%)
555020905 uops_issued.any (18.85%)
1038651176 idq_uops_not_delivered.core (24.95%)
43003170 cpu_clk_unhalted.ref_xclk (25.20%)
1154926272 cpu_clk_unhalted.thread (31.50%)
656873544 uops_retired.retire_slots (31.11%)
16491988 cpu_clk_unhalted.one_thread_active # 0.06 Bad_Speculation_SMT (31.10%)
32064061 int_misc.recovery_cycles_any (31.04%)
648394934 uops_issued.any (31.14%)
42107506 cpu_clk_unhalted.ref_xclk (24.94%)
1124565282 cpu_clk_unhalted.thread (31.14%)
523430886 uops_retired.retire_slots (31.05%)
12328380 cpu_clk_unhalted.one_thread_active # 0.35 Frontend_Bound_SMT (10.08%)
42651836 cpu_clk_unhalted.ref_xclk (10.08%)
1006287722 idq_uops_not_delivered.core (10.08%)
1130593027 cpu_clk_unhalted.thread (10.08%)
14209258 cpu_clk_unhalted.one_thread_active # 0.18 Retiring_SMT (6.39%)
41904474 cpu_clk_unhalted.ref_xclk (6.39%)
522251584 uops_retired.retire_slots (6.39%)
1111257754 cpu_clk_unhalted.thread (6.39%)
12930094 cpu_clk_unhalted.one_thread_active # 2865823806.05 SLOTS_SMT (11.06%)
40975376 cpu_clk_unhalted.ref_xclk (11.06%)
1089204936 cpu_clk_unhalted.thread (11.06%)

1.002165509 seconds time elapsed

$ perf stat -a -M TopDownL1_SMT sleep 1

Performance counter stats for 'system wide':

11893411 cpu_clk_unhalted.one_thread_active # 2715516883.49 SLOTS_SMT
# 0.19 Retiring_SMT
# 0.33 Frontend_Bound_SMT
# 0.04 Bad_Speculation_SMT
# 0.44 Backend_Bound_SMT (71.46%)
28458253 int_misc.recovery_cycles_any (71.44%)
562710994 uops_issued.any (71.42%)
907105260 idq_uops_not_delivered.core (57.12%)
39797715 cpu_clk_unhalted.ref_xclk (57.12%)
1045357060 cpu_clk_unhalted.thread (71.41%)
504809283 uops_retired.retire_slots (71.44%)

1.001939294 seconds time elapsed

Note that without merging the metrics sum to 1.06, but with merging
the sum is 1.

Example on Cascadelake:

$ perf stat -a --metric-no-merge -M TopDownL1_SMT sleep 1

Performance counter stats for 'system wide':

13678949 cpu_clk_unhalted.one_thread_active # 0.59 Backend_Bound_SMT (13.35%)
121286613 int_misc.recovery_cycles_any (18.58%)
4041490966 uops_issued.any (18.81%)
2665605457 idq_uops_not_delivered.core (24.81%)
111757608 cpu_clk_unhalted.ref_xclk (25.03%)
7579026491 cpu_clk_unhalted.thread (31.27%)
3848429110 uops_retired.retire_slots (31.23%)
15554046 cpu_clk_unhalted.one_thread_active # 0.02 Bad_Speculation_SMT (31.19%)
119582342 int_misc.recovery_cycles_any (31.16%)
3813943706 uops_issued.any (31.14%)
113151605 cpu_clk_unhalted.ref_xclk (24.89%)
7621196102 cpu_clk_unhalted.thread (31.12%)
3735690253 uops_retired.retire_slots (31.12%)
13727352 cpu_clk_unhalted.one_thread_active # 0.16 Frontend_Bound_SMT (12.50%)
115441454 cpu_clk_unhalted.ref_xclk (12.50%)
2824946246 idq_uops_not_delivered.core (12.50%)
7817227775 cpu_clk_unhalted.thread (12.50%)
13267908 cpu_clk_unhalted.one_thread_active # 0.21 Retiring_SMT (6.31%)
114015605 cpu_clk_unhalted.ref_xclk (6.31%)
3722498773 uops_retired.retire_slots (6.31%)
7771438396 cpu_clk_unhalted.thread (6.31%)
14948307 cpu_clk_unhalted.one_thread_active # 18085611559.36 SLOTS_SMT (6.30%)
115632797 cpu_clk_unhalted.ref_xclk (6.30%)
8007628156 cpu_clk_unhalted.thread (6.30%)

1.006256703 seconds time elapsed

$ perf stat -a -M TopDownL1_SMT sleep 1

Performance counter stats for 'system wide':

35999534 cpu_clk_unhalted.one_thread_active # 25969550384.66 SLOTS_SMT
# 0.40 Retiring_SMT
# 0.14 Frontend_Bound_SMT
# 0.02 Bad_Speculation_SMT
# 0.44 Backend_Bound_SMT (71.35%)
133499018 int_misc.recovery_cycles_any (71.36%)
10736468874 uops_issued.any (71.40%)
3518076530 idq_uops_not_delivered.core (57.24%)
78296616 cpu_clk_unhalted.ref_xclk (57.25%)
8894997400 cpu_clk_unhalted.thread (71.50%)
10409738753 uops_retired.retire_slots (71.40%)

1.011611791 seconds time elapsed

Note that without merging the metrics sum to 0.98, but with merging
the sum is 1.

v3. is a rebase with following the merging of patches in v2. It also
adds the metric-no-group and metric-no-merge flags.
v2. is the entire patch set based on acme's perf/core tree and includes a
cherry-picks. Patch 13 was sent for review to the bpf maintainers here:
https://lore.kernel.org/lkml/20200506205257.8964-2-irogers@xxxxxxxxxx/
v1. was based on the perf metrics fixes and test sent here:
https://lore.kernel.org/lkml/20200501173333.227162-1-irogers@xxxxxxxxxx/

Andrii Nakryiko (1):
libbpf: Fix memory leak and possible double-free in hashmap__clear

Ian Rogers (13):
perf parse-events: expand add PMU error/verbose messages
perf test: improve pmu event metric testing
lib/bpf hashmap: increase portability
perf expr: fix memory leaks in bison
perf evsel: fix 2 memory leaks
perf expr: migrate expr ids table to libbpf's hashmap
perf metricgroup: change evlist_used to a bitmap
perf metricgroup: free metric_events on error
perf metricgroup: always place duration_time last
perf metricgroup: delay events string creation
perf metricgroup: order event groups by size
perf metricgroup: remove duped metric group events
perf metricgroup: add options to not group or merge

tools/lib/bpf/hashmap.c | 7 +
tools/lib/bpf/hashmap.h | 3 +-
tools/perf/Documentation/perf-stat.txt | 19 ++
tools/perf/arch/x86/util/intel-pt.c | 32 +--
tools/perf/builtin-stat.c | 11 +-
tools/perf/tests/builtin-test.c | 5 +
tools/perf/tests/expr.c | 41 ++--
tools/perf/tests/pmu-events.c | 159 +++++++++++++-
tools/perf/tests/pmu.c | 4 +-
tools/perf/tests/tests.h | 2 +
tools/perf/util/evsel.c | 2 +
tools/perf/util/expr.c | 129 +++++++-----
tools/perf/util/expr.h | 22 +-
tools/perf/util/expr.y | 25 +--
tools/perf/util/metricgroup.c | 277 ++++++++++++++++---------
tools/perf/util/metricgroup.h | 6 +-
tools/perf/util/parse-events.c | 29 ++-
tools/perf/util/pmu.c | 33 +--
tools/perf/util/pmu.h | 2 +-
tools/perf/util/stat-shadow.c | 49 +++--
tools/perf/util/stat.h | 2 +
21 files changed, 592 insertions(+), 267 deletions(-)

--
2.26.2.645.ge9eca65c58-goog