[RFC PATCH v3 0/2] perf stat: Add option to aggregate data based on the cache topology

From: K Prateek Nayak
Date: Thu Apr 13 2023 - 02:20:58 EST


Motivation behind this feature is to aggregate the data at the LLC level
for chiplet based processors which currently do not expose the chiplet
details in sysfs cpu topology information.

For the completeness of the feature, the series adds ability to
aggregate data at any cache level. Following is the example of the
output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
chiplet per socket.

$ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
taskset -c 0-15,64-79,128-143,192-207\
perf bench sched messaging -p -t -l 100000 -g 8

# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 8 groups == 320 threads run

Total time: 7.648 [sec]

Performance counter stats for 'system wide':

S0-D0-L3-ID0 16 17,145,912 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID8 16 14,977,628 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID16 16 262,539 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID24 16 3,140 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID32 16 27,403 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID40 16 17,026 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID48 16 7,292 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID56 16 2,464 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID64 16 22,489,306 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID72 16 21,455,257 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID80 16 11,619 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID88 16 30,978 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID96 16 37,628 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID104 16 13,594 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID112 16 10,164 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID120 16 11,259 ls_dmnd_fills_from_sys.ext_cache_remote

7.779171484 seconds time elapsed

The series also adds support for perf stat record and perf stat report
to aggregate data at various cache levels. Following is an example of
recording with aggregation at L2 level and reporting the same data with
aggregation at L3 level.

$ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
taskset -c 0-15,64-79,128-143,192-207\
perf bench sched messaging -p -t -l 100000 -g 8

# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 8 groups == 320 threads run

Total time: 7.318 [sec]

Performance counter stats for 'system wide':

S0-D0-L2-ID0 2 2,171,980 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID1 2 2,048,494 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID2 2 2,120,293 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID3 2 2,224,725 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID4 2 2,021,618 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID5 2 1,995,331 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID6 2 2,163,029 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID7 2 2,104,623 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L2-ID8 2 1,948,776 ls_dmnd_fills_from_sys.ext_cache_remote
...
S0-D0-L2-ID63 2 2,648 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID64 2 2,963,323 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID65 2 2,856,629 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID66 2 2,901,725 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID67 2 3,046,120 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID68 2 2,637,971 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID69 2 2,680,029 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID70 2 2,672,259 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID71 2 2,638,768 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID72 2 3,308,642 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID73 2 3,064,473 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID74 2 3,023,379 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID75 2 2,975,119 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID76 2 2,952,677 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID77 2 2,981,695 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID78 2 3,455,916 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID79 2 2,959,540 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L2-ID80 2 4,977 ls_dmnd_fills_from_sys.ext_cache_remote
...
S1-D1-L2-ID127 2 3,359 ls_dmnd_fills_from_sys.ext_cache_remote

7.451725897 seconds time elapsed

$ sudo perf stat report --per-cache=L3

Performance counter stats for '...':

S0-D0-L3-ID0 16 16,850,093 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID8 16 16,001,493 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID16 16 301,011 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID24 16 26,276 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID32 16 48,958 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID40 16 43,799 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID48 16 16,771 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID56 16 12,544 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID64 16 22,396,824 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID72 16 24,721,441 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID80 16 29,426 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID88 16 54,348 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID96 16 41,557 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID104 16 10,084 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID112 16 14,361 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID120 16 24,446 ls_dmnd_fills_from_sys.ext_cache_remote

7.451725897 seconds time elapsed

The aggregate at S0-D0-L3-ID0 is the sum of S0-D0-L2-ID0 to S0-D0-L3-ID7
as L3 containing CPU0 contains the L2 instance of CPU0 to CPU7.

[New in v3 - Handling IDs differently compared to v2]
Cache IDs are now derived from the shared_cpus_list file in the cache
topology. This allows for --per-cache aggregation of data on a kernel
which does not expose the cache instance ID in the sysfs. Running perf
stat will give the following output on the same system with cache
instance ID hidden:

$ ls /sys/devices/system/cpu/cpu0/cache/index0/

coherency_line_size level number_of_sets physical_line_partition
shared_cpu_list shared_cpu_map size type uevent
ways_of_associativity

$ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
taskset -c 0-15,64-79,128-143,192-207\
perf bench sched messaging -p -t -l 100000 -g 8

# Running 'sched/messaging' benchmark:
# 20 sender and receiver threads per group
# 8 groups == 320 threads run

Total time: 6.949 [sec]

Performance counter stats for 'system wide':

S0-D0-L3-ID0 16 5,297,615 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID8 16 4,347,868 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID16 16 416,593 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID24 16 4,346 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID32 16 5,506 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID40 16 15,845 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID48 16 24,164 ls_dmnd_fills_from_sys.ext_cache_remote
S0-D0-L3-ID56 16 4,543 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID64 16 41,610,374 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID72 16 38,393,688 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID80 16 22,188 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID88 16 22,918 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID96 16 39,230 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID104 16 6,236 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID112 16 66,846 ls_dmnd_fills_from_sys.ext_cache_remote
S1-D1-L3-ID120 16 72,713 ls_dmnd_fills_from_sys.ext_cache_remote

7.098471410 seconds time elapsed

This series makes breaking change when saving the aggregation details as
the cache level needs to be saved along with the aggregation method.

This RFC assumes that caches at same level will be shared by same set of
threads. The implementation will run into an issue if, say L1i is thread
local, but L1d is shared by the SMT siblings on the core. I'm seeking
clarification from the community about the same and potential solutions
if processors where such a scenario exist.

This series cleanly applies on top perf-tool branch from Arnaldo's tree
(https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
at:

commit e8d018dd0257 ("Linux 6.3-rc3")

--
Changelog:
o v2->v3:
- Dropped patches 1 and 2 that saved and retrieved the cache instance
ID when saving the cache data.
- The above is unnecessary as the IDs are being derived from the first
online CPU in the cache domain for a given cache instance.
- Improvements to handling cases where a cache level is not present
but the level is allowed by MAX_CACHE_LVL.
- Updated details in cover letter.

o v1->v2
- Set cache instance ID to 0 if the file cannot be read.
- Fix cache level parsing function.
- Updated details in cover letter.
--
K Prateek Nayak (2):
perf: Extract building cache level for a CPU into separate function
perf: Add option for --per-cache aggregation

tools/lib/perf/include/perf/cpumap.h | 5 +
tools/lib/perf/include/perf/event.h | 3 +-
tools/perf/Documentation/perf-stat.txt | 16 ++
tools/perf/builtin-stat.c | 144 +++++++++++++++++-
.../tests/shell/lib/perf_json_output_lint.py | 4 +-
tools/perf/tests/shell/stat+csv_output.sh | 14 ++
tools/perf/tests/shell/stat+json_output.sh | 13 ++
tools/perf/util/cpumap.c | 118 ++++++++++++++
tools/perf/util/cpumap.h | 28 ++++
tools/perf/util/event.c | 7 +-
tools/perf/util/header.c | 62 +++++---
tools/perf/util/header.h | 4 +
tools/perf/util/stat-display.c | 17 +++
tools/perf/util/stat-shadow.c | 1 +
tools/perf/util/stat.h | 2 +
tools/perf/util/synthetic-events.c | 1 +
16 files changed, 409 insertions(+), 30 deletions(-)

--
2.34.1