Re: [PATCH v1 00/15] Introduce threaded trace streaming for basic perf record operation

From: Arnaldo Carvalho de Melo
Date: Tue Oct 13 2020 - 12:20:12 EST


Em Mon, Oct 12, 2020 at 11:50:29AM +0300, Alexey Budankov escreveu:
>
> Patch set provides threaded trace streaming for base perf record
> operation. Provided streaming mode (--threads) mitigates profiling
> data losses and resolves scalability issues of serial and asynchronous
> (--aio) trace streaming modes on multicore server systems. The patch
> set is based on the prototype [1], [2] and the most closely relates
> to mode 3) "mode that creates thread for every monitored memory map".
>
> The threaded mode executes one-to-one mapping of trace streaming threads
> to mapped data buffers and streaming into per-CPU trace files located
> at data directory. The data buffers and threads are affined to NUMA
> nodes and monitored CPUs according to system topology. --cpu option
> can be used to specify exact CPUs to be monitored.
>
> Basic analysis of data directories is provided for perf report mode.
> Raw dump (-D) and aggregated reports are available for data directories,
> still with no memory consumption optimizations. However data directories
> collected with --compression-level option enabled can be analyzed with
> little less memory because trace files are unmaped from tool process
> memory after loading collected data.
>
> Provided streaming mode is available with Zstd compression/decompression
> (--compression-level) and handling of external commands (--control).
> AUX area tracing, related and derived modes like --snapshot or
> --aux-sample are not enabled. --switch-output, --switch-output-event,
> --switch-max-files and --timestamp-filename options are not enabled.

Would be interesting to spell out what are the difficulties to have
those options working with this threaded mode, as I expect that once
this is all reviewed and tested we should switch to it by default, no?

- Arnaldo

> Threaded trace streaming is not enabled for pipe mode. Asynchronous
> (--aio) trace streaming and affinity (--affinity) modes are mutually
> exclusive to threaded streaming mode.
>
> See testing results and validation examples below.
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@xxxxxxxxxx/
>
> ---
> Alexey Budankov (15):
> perf session: introduce trace file path to be shown in raw trace dump
> perf report: output trace file name in raw trace dump
> perf data: open data directory in read access mode
> perf session: move reader object definition to header file
> perf session: introduce decompressor into trace reader object
> perf session: load data directory into tool process memory
> perf record: introduce trace file, compressor and stats in mmap object
> perf record: write trace data into mmap trace files
> perf record: introduce thread specific objects for trace streaming
> perf record: manage thread specific data array
> perf evlist: introduce evlist__ctlfd_update() to update ctl fd status
> perf record: introduce thread local variable for trace streaming
> perf record: stop threads in the end of trace streaming
> perf record: start threads in the beginning of trace streaming
> perf record: introduce --threads command line option
>
> tools/perf/Documentation/perf-record.txt | 7 +
> tools/perf/builtin-record.c | 514 +++++++++++++++++++++--
> tools/perf/util/data.c | 4 +
> tools/perf/util/evlist.c | 16 +
> tools/perf/util/evlist.h | 1 +
> tools/perf/util/mmap.c | 6 +
> tools/perf/util/mmap.h | 5 +
> tools/perf/util/ordered-events.h | 1 +
> tools/perf/util/record.h | 1 +
> tools/perf/util/session.c | 150 ++++---
> tools/perf/util/session.h | 28 ++
> tools/perf/util/tool.h | 3 +-
> 12 files changed, 635 insertions(+), 101 deletions(-)
>
> ---
> Testing results:
>
> $ perf test
> 1: vmlinux symtab matches kallsyms : Skip
> 2: Detect openat syscall event : Ok
> 3: Detect openat syscall event on all cpus : Ok
> 4: Read samples using the mmap interface : Ok
> 5: Test data source output : Ok
> 6: Parse event definition strings : Ok
> 7: Simple expression parser : Ok
> 8: PERF_RECORD_* events & perf_sample fields : Ok
> 9: Parse perf pmu format : Ok
> 10: PMU events :
> 10.1: PMU event table sanity : Ok
> 10.2: PMU event map aliases : Ok
> 10.3: Parsing of PMU event table metrics : Skip (some metrics failed)
> 10.4: Parsing of PMU event table metrics with fake PMUs : Ok
> 11: DSO data read : Ok
> 12: DSO data cache : Ok
> 13: DSO data reopen : FAILED!
> 14: Roundtrip evsel->name : Ok
> 15: Parse sched tracepoints fields : Ok
> 16: syscalls:sys_enter_openat event fields : Ok
> 17: Setup struct perf_event_attr : Ok
> 18: Match and link multiple hists : Ok
> 19: 'import perf' in python : FAILED!
> 20: Breakpoint overflow signal handler : Ok
> 21: Breakpoint overflow sampling : Ok
> 22: Breakpoint accounting : Ok
> 23: Watchpoint :
> 23.1: Read Only Watchpoint : Skip
> 23.2: Write Only Watchpoint : Ok
> 23.3: Read / Write Watchpoint : Ok
> 23.4: Modify Watchpoint : Ok
> 24: Number of exit events of a simple workload : Ok
> 25: Software clock events period values : Ok
> 26: Object code reading : Ok
> 27: Sample parsing : Ok
> 28: Use a dummy software event to keep tracking : Ok
> 29: Parse with no sample_id_all bit set : Ok
> 30: Filter hist entries : Ok
> 31: Lookup mmap thread : Ok
> 32: Share thread maps : Ok
> 33: Sort output of hist entries : Ok
> 34: Cumulate child hist entries : Ok
> 35: Track with sched_switch : Ok
> 36: Filter fds with revents mask in a fdarray : Ok
> 37: Add fd to a fdarray, making it autogrow : Ok
> 38: kmod_path__parse : Ok
> 39: Thread map : Ok
> 40: LLVM search and compile :
> 40.1: Basic BPF llvm compile : Skip
> 40.2: kbuild searching : Skip
> 40.3: Compile source for BPF prologue generation : Skip
> 40.4: Compile source for BPF relocation : Skip
> 41: Session topology : Ok
> 42: BPF filter :
> 42.1: Basic BPF filtering : Skip
> 42.2: BPF pinning : Skip
> 42.3: BPF prologue generation : Skip
> 42.4: BPF relocation checker : Skip
> 43: Synthesize thread map : Ok
> 44: Remove thread map : Ok
> 45: Synthesize cpu map : Ok
> 46: Synthesize stat config : Ok
> 47: Synthesize stat : Ok
> 48: Synthesize stat round : Ok
> 49: Synthesize attr update : Ok
> 50: Event times : Ok
> 51: Read backward ring buffer : Ok
> 52: Print cpu map : Ok
> 53: Merge cpu map : Ok
> 54: Probe SDT events : Ok
> 55: is_printable_array : Ok
> 56: Print bitmap : Ok
> 57: perf hooks : Ok
> 58: builtin clang support : Skip (not compiled in)
> 59: unit_number__scnprintf : Ok
> 60: mem2node : Ok
> 61: time utils : Ok
> 62: Test jit_write_elf : Ok
> 63: Test libpfm4 support : Skip (not compiled in)
> 64: Test api io : Ok
> 65: maps__merge_in : Ok
> 66: Demangle Java : Ok
> 67: Parse and process metrics : Ok
> 68: PE file support : Ok
> 69: Event expansion for cgroups : Ok
> 70: x86 rdpmc : Ok
> 71: Convert perf time to TSC : Ok
> 72: DWARF unwind : Ok
> 73: x86 instruction decoder - new instructions : Ok
> 74: Intel PT packet decoder : Ok
> 75: x86 bp modify : Ok
> 76: Check open filename arg using perf trace + vfs_getname : FAILED!
> 77: Add vfs_getname probe to get syscall args filenames : FAILED!
> 78: probe libc's inet_pton & backtrace it with ping : Ok
> 79: Use vfs_getname probe to get syscall args filenames : FAILED!
> 80: Zstd perf.data compression/decompression : Ok
> 81: Check Arm CoreSight trace data recording and synthesized samples: Skip
>
> Validation:
>
> $ tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
> Addr of buf1 = 0x7f8af24ae010
> Offs of buf1 = 0x7f8af24ae180
> Addr of buf2 = 0x7f8af04ad010
> Offs of buf2 = 0x7f8af04ad1c0
> Addr of buf3 = 0x7f8aee4ac010
> Offs of buf3 = 0x7f8aee4ac100
> Addr of buf4 = 0x7f8aec4ab010
> Offs of buf4 = 0x7f8aec4ab140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 24.579 seconds
> [ perf record: Woken up 972 times to write data ]
> [ perf record: Captured and wrote 11.280 MB perf.data (403 samples), compressed (original 66.670 MB, ratio is 5.926) ]
>
> $ perf report --header --stats
> # ========
> # captured on : Wed Oct 7 19:07:50 2020
> # header version : 1
> # data offset : 504
> # data size : 30056
> # feat offset : 30560
> # hostname : nntvtune39
> # os release : 5.8.8-100.fc31.x86_64
> # perf version : 4.13.rc5.gdf19b1347b82
> # arch : x86_64
> # nrcpus online : 8
> # nrcpus avail : 8
> # cpudesc : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
> # cpuid : GenuineIntel,6,94,3
> # total memory : 16119484 kB
> # cmdline : /root/abudanko/kernel/acme/tools/perf/perf record --threads --compression-level=3 -e cycles,instructions -- ../../bench /matrix/linux/matrix.gcc.g.O3
> # event : name = cycles, , id = { 54562, 54563, 54564, 54565, 54566, 54567, 54568, 54569 }, size = 120, { sample_period, sample_freq } = 4000, sample_type = IP|TID|TIME|ID|>
> # event : name = instructions, , id = { 54570, 54571, 54572, 54573, 54574, 54575, 54576, 54577 }, size = 120, config = 0x1, { sample_period, sample_freq } = 4000, sample_ty>
> # sibling sockets : 0-7
> # sibling dies : 0-7
> # sibling threads : 0,4
> # sibling threads : 1,5
> # sibling threads : 2,6
> # sibling threads : 3,7
> # CPU 0: Core ID 0, Die ID 0, Socket ID 0
> # CPU 1: Core ID 1, Die ID 0, Socket ID 0
> # CPU 2: Core ID 2, Die ID 0, Socket ID 0
> # CPU 3: Core ID 3, Die ID 0, Socket ID 0
> # CPU 4: Core ID 0, Die ID 0, Socket ID 0
> # CPU 5: Core ID 1, Die ID 0, Socket ID 0
> # CPU 6: Core ID 2, Die ID 0, Socket ID 0
> # CPU 7: Core ID 3, Die ID 0, Socket ID 0
> # node0 meminfo : total = 16119484 kB, free = 933088 kB
> # node0 cpu list : 0-7
> # pmu mappings: intel_pt = 8, software = 1, power = 19, uprobe = 7, uncore_imc = 11, cpu = 4, cstate_core = 17, uncore_cbox_2 = 14, breakpoint = 5, uncore_cbox_0 = 12, trac>
> # CPU cache info:
> # L1 Data 32K [0,4]
> # L1 Instruction 32K [0,4]
> # L1 Data 32K [1,5]
> # L1 Instruction 32K [1,5]
> # L1 Data 32K [1,5]
> # L1 Instruction 32K [1,5]
> # L1 Data 32K [2,6]
> # L1 Instruction 32K [2,6]
> # L1 Data 32K [3,7]
> # L1 Instruction 32K [3,7]
> # L2 Unified 256K [0,4]
> # L2 Unified 256K [1,5]
> # L2 Unified 256K [2,6]
> # L2 Unified 256K [3,7]
> # L3 Unified 8192K [0-7]
> # time of first sample : 0.000000
> # time of last sample : 0.000000
> # sample duration : 0.000 ms
> # memory nodes (nr 1, block size 0x8000000):
> # 0 [16G]: 0-23,32-135
> # directory data version : 1
> # bpf_prog_info 835: bpf_prog_6deef7357e7b4530 addr 0xffffffffc07c5138 size 66
> # bpf_prog_info 836: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0804558 size 66
> # bpf_prog_info 837: bpf_prog_84efc2eecc454ca6 addr 0xffffffffc080804c size 373
> # bpf_prog_info 838: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0817f50 size 66
> # bpf_prog_info 839: bpf_prog_6deef7357e7b4530 addr 0xffffffffc082aee8 size 66
> # bpf_prog_info 840: bpf_prog_5a2b06eab81b8f51 addr 0xffffffffc082c0fc size 1132
> # bpf_prog_info 841: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0aac380 size 62
> # bpf_prog_info 842: bpf_prog_8a642ed2819e9acc addr 0xffffffffc0d4c970 size 62
> # bpf_prog_info 843: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d4e8bc size 66
> # bpf_prog_info 844: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d5050c size 66
> # bpf_prog_info 845: bpf_prog_3ff695c0afc2579c addr 0xffffffffc0d56098 size 1552
> # bpf_prog_info 846: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d58364 size 66
> # bpf_prog_info 847: bpf_prog_6deef7357e7b4530 addr 0xffffffffc0d6ec00 size 66
> # compressed : Zstd, level = 3, ratio = 6
> # cpu pmu capabilities: branches=32, max_precise=3, pmu_name=skylake
> # missing features: TRACING_DATA BUILD_ID BRANCH_STACK GROUP_DESC AUXTRACE STAT CLOCKID CLOCK_DATA
> # ========
> #
> Aggregated stats:
> TOTAL events: 1457107
> MMAP events: 98
> LOST events: 0
> COMM events: 2
> EXIT events: 6
> THROTTLE events: 0
> UNTHROTTLE events: 0
> FORK events: 8
> READ events: 0
> SAMPLE events: 1456388
> MMAP2 events: 7
> AUX events: 0
> ITRACE_START events: 0
> LOST_SAMPLES events: 0
> SWITCH events: 0
> SWITCH_CPU_WIDE events: 0
> NAMESPACES events: 0
> KSYMBOL events: 13
> BPF_EVENT events: 13
> CGROUP events: 166
> TEXT_POKE events: 0
> ATTR events: 0
> EVENT_TYPE events: 0
> TRACING_DATA events: 0
> BUILD_ID events: 0
> FINISHED_ROUND events: 0
> ID_INDEX events: 0
> AUXTRACE_INFO events: 0
> AUXTRACE events: 0
> AUXTRACE_ERROR events: 0
> THREAD_MAP events: 1
> CPU_MAP events: 1
> STAT_CONFIG events: 0
> STAT events: 0
> STAT_ROUND events: 0
> EVENT_UPDATE events: 0
> TIME_CONV events: 1
> FEATURE events: 0
> COMPRESSED events: 403
>
> $ tools/perf/perf record --threads --compression-level=3 -g --call-graph dwarf,1024 -e cycles,instructions -- ../../bench/matrix/linux/matrix.gcc.g.O3
> Addr of buf1 = 0x7f15e8b89010
> Offs of buf1 = 0x7f15e8b89180
> Addr of buf2 = 0x7f15e6b88010
> Offs of buf2 = 0x7f15e6b881c0
> Addr of buf3 = 0x7f15e4b87010
> Offs of buf3 = 0x7f15e4b87100
> Addr of buf4 = 0x7f15e2b86010
> Offs of buf4 = 0x7f15e2b86140
> Threads #: 8 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 24.281 seconds
> [ perf record: Woken up 6923 times to write data ]
> [ perf record: Captured and wrote 34.933 MB perf.data (10414 samples), compressed (original 1728.795 MB, ratio is 49.530) ]
>
> $ perf report --stdio
> # To display the perf.data header info, please use --header/--header-only options.
> #
> #
> # Total Lost Samples: 0
> #
> # Samples: 708K of event 'cycles'
> # Event count (approx.): 277718834163
> #
> # Children Self Command Shared Object Symbol
> # ........ ........ ............... ........................ ........................................................
> #
> 99.97% 0.00% matrix.gcc.g.O3 libc-2.30.so [.] __GI___clone (inlined)
> |
> ---__GI___clone (inlined)
> |
> --99.97%--start_thread
> |
> --99.97%--ThreadFunction
> multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 99.97% 0.00% matrix.gcc.g.O3 libpthread-2.30.so [.] start_thread
> |
> ---start_thread
> |
> --99.97%--ThreadFunction
> |
> --99.97%--multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 99.97% 0.00% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] ThreadFunction
> |
> ---ThreadFunction
> |
> --99.97%--multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 99.97% 98.68% matrix.gcc.g.O3 matrix.gcc.g.O3 [.] multiply1
> |
> |--98.68%--__GI___clone (inlined)
> | start_thread
> | ThreadFunction
> | multiply1
> |
> --1.28%--multiply1
> |
> --0.96%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
> |
> --0.60%--__sysvec_apic_timer_interrupt
> |
> --0.57%--hrtimer_interrupt
>
> 0.97% 0.05% matrix.gcc.g.O3 [kernel.kallsyms] [k] asm_sysvec_apic_timer_interrupt
> |
> --0.91%--asm_sysvec_apic_timer_interrupt
> |
> --0.70%--subflow_syn_recv_sock
>
>
> --
> 2.24.1
>

--

- Arnaldo