[RFC PATCH v2 0/4] perf stat: Add option to aggregate data based on the cache topology

From: K Prateek Nayak
Date: Wed Apr 05 2023 - 13:09:35 EST


Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

$ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

Performance counter stats for 'system wide':

S0-D0-L3-ID0 16 4,463 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID1 16 2,962 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID2 16 2,592 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID3 16 2,508 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID4 16 1,841 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID5 16 1,764 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID6 16 1,205 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID7 16 5,806 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID8 16 1,461 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID9 16 648 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID10 16 1,443 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID11 16 1,333 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID12 16 1,167 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID13 16 640 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID14 16 601 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID15 16 3,423 ls_dmnd_fills_from_sys.ext_cache_remote

5.017954593 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

$ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

Performance counter stats for 'system wide':

S0-D0-L2-ID0 2 3,212 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID1 2 240 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID2 2 10 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID3 2 13 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID4 2 13 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID5 2 319 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID6 2 348 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID7 2 648 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID8 2 284 ls_dmnd_fills_from_sys.ext_cache_remote
...
S1-D1-L2-ID127 2 113 ls_dmnd_fills_from_sys.ext_cache_remote

5.017958787 seconds time elapsed

$ sudo perf stat report --per-cache=L3

Performance counter stats for '/home/amd/dev/linux/tools/perf/perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5':

S0-D0-L3-ID0 16 4,803 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID1 16 3,421 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID2 16 1,149 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID3 16 1,220 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID4 16 1,502 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID5 16 6,751 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID6 16 1,600 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID7 16 1,985 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID8 16 1,566 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID9 16 1,010 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID10 16 1,337 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID11 16 2,298 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID12 16 314 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID13 16 350 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID14 16 664 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID15 16 3,834 ls_dmnd_fills_from_sys.ext_cache_remote

5.017958787 seconds time elapsed

The sum of the aggregate at L2 from S0-D0-L2-ID0 to S0-D0-L2-ID7 is
equal to the value for S0-D0-L3-ID0 in perf stat report with aggregation
at L3 level since L3-ID0 contains L2-ID0 to L2-ID7 on the machine.

[New in v2]
On a kernel which does not have the cache instance ID in the sysfs, the
cache ID is set to 0. Running perf stat will give the following output
on the same system with cache instance ID hidden:

$ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote -- sleep 5

Performance counter stats for 'system wide':

S0-D0-L3-ID0 128 13,277 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID0 128 9,822 ls_dmnd_fills_from_sys.ext_cache_remote

5.020718145 seconds time elapsed

This series makes breaking change when saving the cache details of env
for recording and reporting purpose. If there is a better way to do
this, please do let me know.

Following points were not considered when designing this RFC:

- Handling multiple cache types at same level: For example consider a
case where L1i is thread local but L1d is core-wide. The logic
currently selects the last cache instance it sees at a particular
level when iterating over the indices. This may lead to user expecting
a different result than the one perf reported.

- For the same example as above, where L1i is thread local and L1d is
core-wide, the record and report might not give consistent result as
the qsort() function used to sort cache_level_data[] when saving the
env data is unstable and might not preserve the order for the different
caches at same level. Since we consider the data for the last set of
data at the same level, the unstable sort might lead to
inconsistencies.

I'm seeking some clarification from the community for the above problems
and potential solutions for processors where all CPUs might not share
the same topology structure.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at:

commit e8d018dd0257 ("Linux 6.3-rc3")

--
Changelog:
o v1->v2
- Set cache instance ID to 0 if the file cannot be read.
- Fix cache level parsing function.
- Updated details in cover letter.
--
K Prateek Nayak (4):
perf: Read cache instance ID when building cache topology
perf: Save cache instance ID when saving cache topology data
perf: Extract building cache level for a CPU into separate function
perf: Add option for --per-cache aggregation

tools/lib/perf/include/perf/cpumap.h | 5 +
tools/lib/perf/include/perf/event.h | 3 +-
tools/perf/Documentation/perf-stat.txt | 16 ++
tools/perf/builtin-stat.c | 149 +++++++++++++++++-
.../tests/shell/lib/perf_json_output_lint.py | 4 +-
tools/perf/tests/shell/stat+csv_output.sh | 14 ++
tools/perf/tests/shell/stat+json_output.sh | 13 ++
tools/perf/util/cpumap.c | 97 ++++++++++++
tools/perf/util/cpumap.h | 17 ++
tools/perf/util/env.h | 1 +
tools/perf/util/event.c | 7 +-
tools/perf/util/header.c | 77 ++++++---
tools/perf/util/header.h | 4 +
tools/perf/util/stat-display.c | 16 ++
tools/perf/util/stat-shadow.c | 1 +
tools/perf/util/stat.h | 2 +
tools/perf/util/synthetic-events.c | 1 +
17 files changed, 395 insertions(+), 32 deletions(-)

--
2.34.1