Re: [PATCH v10 00/24] Introduce threaded trace streaming for basic perf record operation
From: Riccardo Mancini
Date: Wed Aug 11 2021 - 13:45:35 EST
Hi,
I was wondering whether you should use perf_set_multithreaded() and
perf_set_singlethreaded() functions in your patchset, as it's done with
synthetic events.
Afaict, they enable/disable certain locks and semaphores.
Apart from this doubt, I ran some experiments on my laptop and on a VM, running
stress-ng as workload and using perf-stat to gather information about the
threaded perf-record.
Here are some results:
- the number of missed chunks decreases (running a high frequency ~1kHz with
dwarf causes some misses on the single thread but none in multithread);
- the overall utilization of perf-record does not significantly change;
- the total size of the perf.data does not significantly change.
I also noticed that the --threads=socket option does not work as expected on my
VM, creating 16 threads. Anyways, it looks like a problem (or feature) in the
/sys fs and not a bug of this patchset, since all cpus are like the following:
$ grep "" /sys/devices/system/cpu/cpu0/topology/*_list
/sys/devices/system/cpu/cpu0/topology/core_siblings_list:0
/sys/devices/system/cpu/cpu0/topology/thread_siblings_list:0
A part from the mentioned small issues:
Reviewed-by: Riccardo Mancini <rickyman7@xxxxxxxxx>
Tested-by: Riccardo Mancini <rickyman7@xxxxxxxxx>
Thanks,
Riccardo
On Mon, 2021-07-12 at 09:46 +0300, Alexey Bayduraev wrote:
> Changes in v10:
> - renamed fdarray__clone to fdarray__dup_entry_from
> - captured Acked-by: tags by Namhyung Kim for 09/24
>
> v9:
> https://lore.kernel.org/lkml/cover.1625227739.git.alexey.v.bayduraev@xxxxxxxxxxxxxxx/
>
> Changes in v9:
> - fixes in [v9 01/24]:
> - move 'nr_threads' to before 'thread_masks'
> - combined decl+assign into one line in record__thread_mask_alloc
> - releasing masks inplace in record__alloc_thread_masks
> - split patch [v8 02/22] to [v9 02/24] and [v9 03/24]
> - fixes in [v9 03/24]:
> - renamed 'struct thread_data' to 'struct record_thread'
> - moved nr_mmaps after ctlfd_pos
> - releasing resources inplace in record__thread_data_init_maps
> - initializing pipes by -1 value
> - added temporary gettid() wrapper
> - split patch [v8 03/22] to [v9 04/24] and [v9 05/24]
> - removed upstreamed [v8 09/22]
> - split [v8 10/22] to [v9 12/24] and [v9 13/24]
> - moved --threads documentation to the related patches
> - fixed output of written/compressed stats in [v9 10/24]
> - split patch [v8 12/22] to [v9 15/24] and [v9 16/24]
> - fixed order of error checking for decompressed events in [v9 16/24]
> - merged patch [v8 21/22] with [v9 23/24] and [v9 24/24]
> - moved patch [v8 22/22] to [v9 09/24]
> - added max reader size constant in [v9 24/24]
>
> v8:
> https://lore.kernel.org/lkml/cover.1625065643.git.alexey.v.bayduraev@xxxxxxxxxxxxxxx/
>
> Changes in v8:
> - captured Acked-by: tags by Namhyung Kim
> - merged with origin/perf/core
> - added patch 21/22 introducing READER_NODATA state
> - added patch 22/22 fixing --max-size option
>
> v7:
> https://lore.kernel.org/lkml/cover.1624350588.git.alexey.v.bayduraev@xxxxxxxxxxxxxxx/
>
> Changes in v7:
> - fixed possible crash after out_free_threads label
> - added missing pthread_attr_destroy() call
> - added check of correctness of user masks
> - fixed zsts_data finalization
>
> v6:
> https://lore.kernel.org/lkml/cover.1622025774.git.alexey.v.bayduraev@xxxxxxxxxxxxxxx/
>
> Changes in v6:
> - fixed leaks and possible double free in record__thread_mask_alloc()
> - fixed leaks in record__init_thread_user_masks()
> - fixed final mmaps flushing for threads id > 0
> - merged with origin/perf/core
>
> v5:
> https://lore.kernel.org/lkml/cover.1619781188.git.alexey.v.bayduraev@xxxxxxxxxxxxxxx/
>
> Changes in v5:
> - fixed leaks in record__init_thread_masks_spec()
> - fixed leaks after failed realloc
> - replaced "%m" to strerror()
> - added masks examples to the documentation
> - captured Acked-by: tags by Andi Kleen
> - do not allow --thread option for full_auxtrace mode
> - split patch 06/12 to 06/20 and 07/20
> - split patch 08/12 to 09/20 and 10/20
> - split patches 11/12 and 11/12 to 13/20-20/20
>
> v4:
> https://lore.kernel.org/lkml/6c15adcb-6a9d-320e-70b5-957c4c8b6ff2@xxxxxxxxxxxxxxx/
>
> Changes in v4:
> - renamed 'comm' structure to 'pipes'
> - moved thread fd/maps messages to verbose=2
> - fixed leaks during allocation of thread_data structures
> - fixed leaks during allocation of thread masks
> - fixed possible fails when releasing thread masks
>
> v3:
> https://lore.kernel.org/lkml/7d197a2d-56e2-896d-bf96-6de0a4db1fb8@xxxxxxxxxxxxxxx/
>
> Changes in v3:
> - avoided skipped redundant patch 3/15
> - applied "data file" and "data directory" terms allover the patch set
> - captured Acked-by: tags by Namhyung Kim
> - avoided braces where don't needed
> - employed thread local variable for serial trace streaming
> - added specs for --thread option - core, socket, numa and user defined
> - added parallel loading of data directory files similar to the prototype [1]
>
> v2:
> https://lore.kernel.org/lkml/1ec29ed6-0047-d22f-630b-a7f5ccee96b4@xxxxxxxxxxxxxxx/
>
> Changes in v2:
> - explicitly added credit tags to patches 6/15 and 15/15,
> additionally to cites [1], [2]
> - updated description of 3/15 to explicitly mention the reason
> to open data directories in read access mode (e.g. for perf report)
> - implemented fix for compilation error of 2/15
> - explicitly elaborated on found issues to be resolved for
> threaded AUX trace capture
>
> v1:
> https://lore.kernel.org/lkml/810f3a69-0004-9dff-a911-b7ff97220ae0@xxxxxxxxxxxxxxx/
>
> Patch set provides parallel threaded trace streaming mode for basic
> perf record operation. Provided mode mitigates profiling data losses
> and resolves scalability issues of serial and asynchronous (--aio)
> trace streaming modes on multicore server systems. The design and
> implementation are based on the prototype [1], [2].
>
> Parallel threaded mode executes trace streaming threads that read kernel
> data buffers and write captured data into several data files located at
> data directory. Layout of trace streaming threads and their mapping to data
> buffers to read can be configured using a value of --thread command line
> option. Specification value provides masks separated by colon so the masks
> define cpus to be monitored by one thread and thread affinity mask is
> separated by slash. <cpus mask 1>/<affinity mask 1>:<cpu mask 2>/<affinity mask
> 2>
> specifies parallel threads layout that consists of two threads with
> corresponding assigned cpus to be monitored. Specification value can be
> a string e.g. "cpu", "core" or "socket" meaning creation of data streaming
> thread for monitoring every cpu, whole core or socket. The option provided
> with no or empty value defaults to "cpu" layout creating data streaming
> thread for every cpu being monitored. Specification masks are filtered
> by the mask provided via -C option.
>
> Parallel streaming mode is compatible with Zstd compression/decompression
> (--compression-level) and external control commands (--control). The mode
> is not enabled for pipe mode. The mode is not enabled for AUX area tracing,
> related and derived modes like --snapshot or --aux-sample. --switch-output-*
> and --timestamp-filename options are not enabled for parallel streaming.
> Initial intent to enable AUX area tracing faced the need to define some
> optimal way to store index data in data directory. --switch-output-* and
> --timestamp-filename use cases are not clear for data directories.
> Asynchronous(--aio) trace streaming and affinity (--affinity) modes are
> mutually exclusive to parallel streaming mode.
>
> Basic analysis of data directories is provided in perf report mode.
> Raw dump and aggregated reports are available for data directories,
> still with no memory consumption optimizations.
>
> Tested:
>
> tools/perf/perf record -o prof.data --threads -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads= -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=cpu -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=core -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=socket -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=numa -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads=0-3/3:4-7/4 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 2,5 --threads=0-3/3:4-7/4 --
> matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 3,4 --threads=0-3/3:4-7/4 --
> matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=core -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data -C 0,4,2,6 --threads=numa -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --
> matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -g --call-graph dwarf,4096 --
> compression-level=3 -- matrix.gcc.g.O3
> tools/perf/perf record -o prof.data --threads -a
> tools/perf/perf record -D -1 -e cpu-cycles -a --control fd:10,11 -- sleep 30
> tools/perf/perf record --threads -D -1 -e cpu-cycles -a --control fd:10,11 --
> sleep 30
>
> tools/perf/perf report -i prof.data
> tools/perf/perf report -i prof.data --call-graph=callee
> tools/perf/perf report -i prof.data --stdio --header
> tools/perf/perf report -i prof.data -D --header
>
> [1] git clone https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git -b
> perf/record_threads
> [2] https://lore.kernel.org/lkml/20180913125450.21342-1-jolsa@xxxxxxxxxx/
>
> Alexey Bayduraev (24):
> perf record: Introduce thread affinity and mmap masks
> tools lib: Introduce fdarray duplicate function
> perf record: Introduce thread specific data array
> perf record: Introduce function to propagate control commands
> perf record: Introduce thread local variable
> perf record: Stop threads in the end of trace streaming
> perf record: Start threads in the beginning of trace streaming
> perf record: Introduce data file at mmap buffer object
> perf record: Introduce bytes written stats to support --max-size
> option
> perf record: Introduce data transferred and compressed stats
> perf record: Init data file at mmap buffer object
> perf record: Introduce --threads command line option
> perf record: Extend --threads command line option
> perf record: Implement compatibility checks
> perf report: Output non-zero offset for decompressed records
> perf report: Output data file name in raw trace dump
> perf session: Move reader structure to the top
> perf session: Introduce reader_state in reader object
> perf session: Introduce reader objects in session object
> perf session: Introduce decompressor into trace reader object
> perf session: Move init into reader__init function
> perf session: Move map/unmap into reader__mmap function
> perf session: Load single file for analysis
> perf session: Load data directory files for analysis
>
> tools/lib/api/fd/array.c | 17 +
> tools/lib/api/fd/array.h | 1 +
> tools/perf/Documentation/perf-record.txt | 30 +
> tools/perf/builtin-inject.c | 3 +-
> tools/perf/builtin-kvm.c | 2 +-
> tools/perf/builtin-record.c | 1197 ++++++++++++++++++++--
> tools/perf/builtin-top.c | 2 +-
> tools/perf/builtin-trace.c | 2 +-
> tools/perf/util/evlist.c | 16 +
> tools/perf/util/evlist.h | 1 +
> tools/perf/util/mmap.c | 6 +
> tools/perf/util/mmap.h | 6 +
> tools/perf/util/ordered-events.c | 3 +-
> tools/perf/util/ordered-events.h | 3 +-
> tools/perf/util/record.h | 2 +
> tools/perf/util/session.c | 506 ++++++---
> tools/perf/util/session.h | 7 +-
> tools/perf/util/tool.h | 3 +-
> 18 files changed, 1593 insertions(+), 214 deletions(-)
>