[RFC PATCH 0/4] perf stat: Add option to aggregate data based on the cache topology

From: K Prateek Nayak
Date: Fri Mar 31 2023 - 00:52:00 EST


Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

$ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

Performance counter stats for 'system wide':

S0-D0-L3-ID0 16 4,463 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID1 16 2,962 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID2 16 2,592 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID3 16 2,508 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID4 16 1,841 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID5 16 1,764 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID6 16 1,205 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID7 16 5,806 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID8 16 1,461 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID9 16 648 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID10 16 1,443 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID11 16 1,333 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID12 16 1,167 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID13 16 640 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID14 16 601 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID15 16 3,423 ls_dmnd_fills_from_sys.ext_cache_remote

5.017954593 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

$ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

Performance counter stats for 'system wide':

S0-D0-L2-ID0 2 3,212 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID1 2 240 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID2 2 10 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID3 2 13 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID4 2 13 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID5 2 319 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID6 2 348 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID7 2 648 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID8 2 284 ls_dmnd_fills_from_sys.ext_cache_remote
...
S1-D1-L2-ID127 2 113 ls_dmnd_fills_from_sys.ext_cache_remote

5.017958787 seconds time elapsed

$ sudo perf stat report --per-cache=L3

Performance counter stats for '/home/amd/dev/linux/tools/perf/perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5':

S0-D0-L3-ID0 16 4,803 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID1 16 3,421 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID2 16 1,149 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID3 16 1,220 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID4 16 1,502 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID5 16 6,751 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID6 16 1,600 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID7 16 1,985 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID8 16 1,566 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID9 16 1,010 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID10 16 1,337 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID11 16 2,298 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID12 16 314 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID13 16 350 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID14 16 664 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID15 16 3,834 ls_dmnd_fills_from_sys.ext_cache_remote

5.017958787 seconds time elapsed

The sum of the aggregate at L2 from S0-D0-L2-ID0 to S0-D0-L2-ID7 is
equal to the value for S0-D0-L3-ID0 in perf stat report with aggregation
at L3 level since L3-ID0 contains L2-ID0 to L2-ID7 on the machine.

This series makes breaking change when saving the cache details of env
for recording and reporting purpose. If there is a better way to do
this, please do let me know.

Following points were not considered when designing this RFC:

- Handling multiple cache types at same level, for example L1i and L1d
both of which are at level 1. The current implementation will retrieve
the instance ID from the last entry in cache_level_data[] with the
matching level. This works as long as L1i and L1d cover same set of
CPUs but will not work for an exotic cache topology.

- If the processor features an exotic cache topology with different
type of caches at same level covering different set of CPUs, the
record and report might not give consistent result as the qsort()
function used to sort cache_level_data[] when saving the env data is
unstable and might not preserve the order for the different caches at
same level.

I'm seeking some clarification from the community for the above problems
and potential solutions for processors where all CPUs might not share
the same topology structure.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at:

commit e8d018dd0257 ("Linux 6.3-rc3")

--
K Prateek Nayak (4):
perf: Read cache instance ID when building cache topology
perf: Save cache instance ID when saving cache topology data
perf: Extract building cache level for a CPU into separate function
perf: Add option for --per-cache aggregation

tools/lib/perf/include/perf/cpumap.h | 5 +
tools/lib/perf/include/perf/event.h | 3 +-
tools/perf/Documentation/perf-stat.txt | 16 ++
tools/perf/builtin-stat.c | 149 +++++++++++++++++-
.../tests/shell/lib/perf_json_output_lint.py | 4 +-
tools/perf/tests/shell/stat+csv_output.sh | 14 ++
tools/perf/tests/shell/stat+json_output.sh | 13 ++
tools/perf/util/cpumap.c | 97 ++++++++++++
tools/perf/util/cpumap.h | 17 ++
tools/perf/util/env.h | 1 +
tools/perf/util/event.c | 7 +-
tools/perf/util/header.c | 77 ++++++---
tools/perf/util/header.h | 4 +
tools/perf/util/stat-display.c | 16 ++
tools/perf/util/stat-shadow.c | 1 +
tools/perf/util/stat.h | 2 +
tools/perf/util/synthetic-events.c | 1 +
17 files changed, 395 insertions(+), 32 deletions(-)

--
2.34.1