Re: [PATCH v4 0/5] perf stat: Add option to aggregate data based on the cache topology

From: Ian Rogers
Date: Wed May 17 2023 - 13:58:19 EST


On Wed, May 17, 2023 at 10:22 AM K Prateek Nayak <kprateek.nayak@xxxxxxx> wrote:
>
> Motivation behind this feature is to aggregate the data at the LLC level
> for chiplet based processors which currently do not expose the chiplet
> details in sysfs cpu topology information.
>
> For the completeness of the feature, the series adds ability to
> aggregate data at any cache level. Following is the example of the
> output on a dual socket Zen3 processor with 2 x 64C/128T containing 8
> chiplet per socket.
>
> $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
> taskset -c 0-15,64-79,128-143,192-207\
> perf bench sched messaging -p -t -l 100000 -g 8
>
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 8 groups == 320 threads run
>
> Total time: 7.648 [sec]
>
> Performance counter stats for 'system wide':
>
> S0-D0-L3-ID0 16 17,145,912 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID8 16 14,977,628 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID16 16 262,539 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID24 16 3,140 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID32 16 27,403 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID40 16 17,026 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID48 16 7,292 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID56 16 2,464 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID64 16 22,489,306 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID72 16 21,455,257 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID80 16 11,619 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID88 16 30,978 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID96 16 37,628 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID104 16 13,594 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID112 16 10,164 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID120 16 11,259 ls_dmnd_fills_from_sys.ext_cache_remote
>
> 7.779171484 seconds time elapsed
>
> The series also adds support for perf stat record and perf stat report
> to aggregate data at various cache levels. Following is an example of
> recording with aggregation at L2 level and reporting the same data with
> aggregation at L3 level.
>
> $ sudo perf stat record --per-cache=L2 -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
> taskset -c 0-15,64-79,128-143,192-207\
> perf bench sched messaging -p -t -l 100000 -g 8
>
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 8 groups == 320 threads run
>
> Total time: 7.318 [sec]
>
> Performance counter stats for 'system wide':
>
> S0-D0-L2-ID0 2 2,171,980 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID1 2 2,048,494 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID2 2 2,120,293 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID3 2 2,224,725 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID4 2 2,021,618 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID5 2 1,995,331 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID6 2 2,163,029 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID7 2 2,104,623 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L2-ID8 2 1,948,776 ls_dmnd_fills_from_sys.ext_cache_remote
> ...
> S0-D0-L2-ID63 2 2,648 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID64 2 2,963,323 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID65 2 2,856,629 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID66 2 2,901,725 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID67 2 3,046,120 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID68 2 2,637,971 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID69 2 2,680,029 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID70 2 2,672,259 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID71 2 2,638,768 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID72 2 3,308,642 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID73 2 3,064,473 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID74 2 3,023,379 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID75 2 2,975,119 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID76 2 2,952,677 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID77 2 2,981,695 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID78 2 3,455,916 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID79 2 2,959,540 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L2-ID80 2 4,977 ls_dmnd_fills_from_sys.ext_cache_remote
> ...
> S1-D1-L2-ID127 2 3,359 ls_dmnd_fills_from_sys.ext_cache_remote
>
> 7.451725897 seconds time elapsed
>
> $ sudo perf stat report --per-cache=L3
>
> Performance counter stats for '...':
>
> S0-D0-L3-ID0 16 16,850,093 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID8 16 16,001,493 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID16 16 301,011 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID24 16 26,276 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID32 16 48,958 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID40 16 43,799 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID48 16 16,771 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID56 16 12,544 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID64 16 22,396,824 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID72 16 24,721,441 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID80 16 29,426 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID88 16 54,348 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID96 16 41,557 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID104 16 10,084 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID112 16 14,361 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID120 16 24,446 ls_dmnd_fills_from_sys.ext_cache_remote
>
> 7.451725897 seconds time elapsed
>
> The aggregate at S0-D0-L3-ID0 is the sum of S0-D0-L2-ID0 to S0-D0-L3-ID7
> as L3 containing CPU0 contains the L2 instance of CPU0 to CPU7.
>
> Cache IDs are derived from the shared_cpus_list file in the cache
> topology. This allows for --per-cache aggregation of data on a kernel
> which does not expose the cache instance ID in the sysfs. Running perf
> stat will give the following output on the same system with cache
> instance ID hidden:
>
> $ ls /sys/devices/system/cpu/cpu0/cache/index0/
>
> coherency_line_size level number_of_sets physical_line_partition
> shared_cpu_list shared_cpu_map size type uevent
> ways_of_associativity
>
> $ sudo perf stat --per-cache -a -e ls_dmnd_fills_from_sys.ext_cache_remote --\
> taskset -c 0-15,64-79,128-143,192-207\
> perf bench sched messaging -p -t -l 100000 -g 8
>
> # Running 'sched/messaging' benchmark:
> # 20 sender and receiver threads per group
> # 8 groups == 320 threads run
>
> Total time: 6.949 [sec]
>
> Performance counter stats for 'system wide':
>
> S0-D0-L3-ID0 16 5,297,615 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID8 16 4,347,868 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID16 16 416,593 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID24 16 4,346 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID32 16 5,506 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID40 16 15,845 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID48 16 24,164 ls_dmnd_fills_from_sys.ext_cache_remote
> S0-D0-L3-ID56 16 4,543 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID64 16 41,610,374 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID72 16 38,393,688 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID80 16 22,188 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID88 16 22,918 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID96 16 39,230 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID104 16 6,236 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID112 16 66,846 ls_dmnd_fills_from_sys.ext_cache_remote
> S1-D1-L3-ID120 16 72,713 ls_dmnd_fills_from_sys.ext_cache_remote
>
> 7.098471410 seconds time elapsed
>
> Few notes:
>
> - This series makes breaking change when saving the aggregation details
> as the cache level needs to be saved along with the aggregation
> method.
>
> - This series assumes that caches at same level will be shared by same
> set of threads. The implementation will run into an issue if, say L1i
> is thread local, but L1d is shared by the SMT siblings on the core.
>
> This series cleanly applies on top perf-tool branch from Arnaldo's tree
> (https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/log/?h=perf-tools)
> at commit 760ebc45746b ("perf lock contention: Add empty 'struct rq' to
> satisfy libbpf 'runqueue' type verification")
> ---
> Changelog:
> o v3->v4:
> - Dropped the RFC tag.
> - Break down Patch 2 from v3 into smaller patches (kind of!)
> - Fixed couple of errors in docs and comments.
>
> o v2->v3:
> - Dropped patches 1 and 2 that saved and retrieved the cache instance
> ID when saving the cache data.
> - The above is unnecessary as the IDs are being derived from the first
> online CPU in the cache domain for a given cache instance.
> - Improvements to handling cases where a cache level is not present
> but the level is allowed by MAX_CACHE_LVL.
> - Updated details in cover letter.
>
> o v1->v2
> - Set cache instance ID to 0 if the file cannot be read.
> - Fix cache level parsing function.
> - Updated details in cover letter.
> ---
> K Prateek Nayak (5):
> perf: Extract building cache level for a CPU into separate function
> perf stat: Setup the foundation to allow aggregation based on cache
> topology
> perf stat: Save cache level information when running perf stat record
> perf stat: Add "--per-cache" aggregation option and document the same
> pert stat: Add tests for the "--per-cache" option

Acked-by: Ian Rogers <irogers@xxxxxxxxxx>

Thanks,
Ian

> tools/lib/perf/include/perf/cpumap.h | 5 +
> tools/lib/perf/include/perf/event.h | 3 +-
> tools/perf/Documentation/perf-stat.txt | 16 ++
> tools/perf/builtin-stat.c | 144 +++++++++++++++++-
> .../tests/shell/lib/perf_json_output_lint.py | 4 +-
> tools/perf/tests/shell/stat+csv_output.sh | 14 ++
> tools/perf/tests/shell/stat+json_output.sh | 13 ++
> tools/perf/util/cpumap.c | 119 +++++++++++++++
> tools/perf/util/cpumap.h | 28 ++++
> tools/perf/util/event.c | 7 +-
> tools/perf/util/header.c | 62 +++++---
> tools/perf/util/header.h | 4 +
> tools/perf/util/stat-display.c | 17 +++
> tools/perf/util/stat.h | 2 +
> tools/perf/util/synthetic-events.c | 1 +
> 15 files changed, 409 insertions(+), 30 deletions(-)
>
> --
> 2.34.1
>