Re: [RFCv2 00/48] perf tools: Add threads to record command
From: Alexey Budankov
Date: Fri Sep 21 2018 - 08:16:07 EST
Hello Jiri,
On 21.09.2018 9:13, Alexey Budankov wrote:
> Hello Jiri,
>
> On 14.09.2018 12:37, Alexey Budankov wrote:
>> On 14.09.2018 11:28, Jiri Olsa wrote:
>>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote:
>>>
>>> SNIP
>>>
>>>>>> The threaded monitoring currently can't monitor backward maps
>>>>>> and there are probably more limitations which I haven't spotted
>>>>>> yet.
>>>>>>
>>>>>> So far I tested on laptop:
>>>>>> http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt
>>>>>>
>>>>>> and a one bigger server:
>>>>>> http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt
>>>>>>
>>>>>> I can see decrease in recorded LOST events, but both the benchmark
>>>>>> and the monitoring must be carefully configured wrt:
>>>>>> - number of events (frequency)
>>>>>> - size of the memory maps
>>>>>> - size of events (callchains)
>>>>>> - final perf.data size
>>>>>>
>>>>>> It's also available in:
>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>>>>>> perf/record_threads
>>>>>>
>>>>>> thoughts? ;-) thanks
>>>>>> jirka
>>>>>
>>>>> It is preferable to split into smaller pieces that bring
>>>>> some improvement proved by metrics numbers and ready for
>>>>> merging and upstream. Do we have more metrics than the
>>>>> data loss from trace AIO patches?
>>>>
>>>> well the primary focus is to get more events in,
>>>> so the LOST metric is the main one
>>>
>>> actualy I was hoping, could you please run it through the same
>>> tests as you do for AIO code on some huge server?
>>
>> Yeah, I will, but it takes some time.
>
> Here it is:
>
> Hardware:
> cat /proc/cpuinfo
> processor : 271
> vendor_id : GenuineIntel
> cpu family : 6
> model : 133
> model name : Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz
> stepping : 0
> microcode : 0xe
> cpu MHz : 1064.235
> cache size : 1024 KB
> physical id : 0
> siblings : 272
> core id : 73
> cpu cores : 68
> apicid : 295
> initial apicid : 295
> fpu : yes
> fpu_exception : yes
> cpuid level : 13
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts avx512_vpopcntdq avx512_4vnniw avx512_4fmaps
> bugs : cpu_meltdown spectre_v1 spectre_v2
> bogomips : 2594.07
> clflush size : 64
> cache_alignment : 64
> address sizes : 46 bits physical, 48 bits virtual
> power management:
>
> uname -a
> Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> cat /proc/sys/kernel/perf_event_paranoid
> 0
>
> cat /proc/sys/kernel/perf_event_mlock_kb
> 516
>
> cat /proc/sys/kernel/perf_event_max_sample_rate
> 3000
>
> cat /etc/redhat-release
> Red Hat Enterprise Linux Server release 7.5 (Maipo)
>
> Metrics:
> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
> data loss (%) : paused_time / elapsed_time_under_profiling
> LOST events : stat from perf report --stats
> SAMPLE events : stat from perf report --stats
> perf.data size (B) : size of trace file on disk
>
> Events:
> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>
> =================================================
>
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
> -e cpu/period=P,event=0x3c/Duk,\
> cpu/period=P,umask=0x3/Duk,\
> cpu/period=P,event=0xc0/Duk,\
> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>
> Workload: matrix multiplication in 256 threads
>
> /usr/bin/time ./matrix.icc
> Addr of buf1 = 0x7ff9faa73010
> Offs of buf1 = 0x7ff9faa73180
> Addr of buf2 = 0x7ff9f8a72010
> Offs of buf2 = 0x7ff9f8a721c0
> Addr of buf3 = 0x7ff9f6a71010
> Offs of buf3 = 0x7ff9f6a71100
> Addr of buf4 = 0x7ff9f4a70010
> Offs of buf4 = 0x7ff9f4a70140
> Threads #: 256 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Freq = 0.997720 GHz
> Execution time = 9.061 seconds
> 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata 100448maxresident)k
> 96inputs+0outputs (1major+33839minor)pagefaults 0swaps
>
> T : 272
> P (period, ms) : 0.1
> runtime overhead (%) : 45x ~ 323.54 / 7.12
> data loss (%) : 96
> LOST events : 323662
> SAMPLE events : 31885479
> perf.data size (GiB) : 42
>
> P (period, ms) : 0.25
> runtime overhead (%) : 25x ~ 180.76 / 7.12
> data loss (%) : 69
> LOST events : 10636
> SAMPLE events : 18692998
> perf.data size (GiB) : 23.5
>
> P (period, ms) : 0.35
> runtime overhead (%) : 16x ~ 119.49 / 7.12
> data loss (%) : 1
> LOST events : 6
> SAMPLE events : 11178524
> perf.data size (GiB) : 14
>
> T : 128
> P (period, ms) : 0.35
> runtime overhead (%) : 15x ~ 111.98 / 7.12
> data loss (%) : 62
> LOST events : 2825
> SAMPLE events : 11267247
> perf.data size (GiB) : 15
>
> T : 64
> P (period, ms) : 0.35
> runtime overhead (%) : 14x ~ 101.55 / 7.12
> data loss (%) : 67
> LOST events : 5155
> SAMPLE events : 10966297
> perf.data size (GiB) : 13.7
>
> Workload: matrix multiplication in 128 threads
>
> /usr/bin/time ./matrix.gcc
> Addr of buf1 = 0x7f072e630010
> Offs of buf1 = 0x7f072e630180
> Addr of buf2 = 0x7f072c62f010
> Offs of buf2 = 0x7f072c62f1c0
> Addr of buf3 = 0x7f072a62e010
> Offs of buf3 = 0x7f072a62e100
> Addr of buf4 = 0x7f072862d010
> Offs of buf4 = 0x7f072862d140
> Threads #: 128 Pthreads
> Matrix size: 2048
> Using multiply kernel: multiply1
> Execution time = 6.639 seconds
> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 100756maxresident)k
> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps
>
> T : 272
> P (period, ms) : 0.1
> runtime overhead (%) : 29x ~ 198.81 / 6.81
> data loss (%) : 21
> LOST events : 2502
> SAMPLE events : 22481062
> perf.data size (GiB) : 27.6
>
> P (period, ms) : 0.25
> runtime overhead (%) : 13x ~ 88.47 / 6.81
> data loss (%) : 0
> LOST events : 0
> SAMPLE events : 9572787
> perf.data size (GiB) : 11.3
>
> P (period, ms) : 0.35
> runtime overhead (%) : 10x ~ 67.11 / 6.81
> data loss (%) : 1
> LOST events : 137
> SAMPLE events : 6985930
> perf.data size (GiB) : 8
>
> T : 128
> P (period, ms) : 0.35
> runtime overhead (%) : 9.5x ~ 64.33 / 6.81
> data loss (%) : 1
> LOST events : 3
> SAMPLE events : 6666903
> perf.data size (GiB) : 7.8
>
> T : 64
> P (period, ms) : 0.25
> runtime overhead (%) : 17x ~ 114.27 / 6.81
> data loss (%) : 2
> LOST events : 52
> SAMPLE events : 12643645
> perf.data size (GiB) : 15.5
>
> P (period, ms) : 0.35
> runtime overhead (%) : 10x ~ 68.60 / 6.81
> data loss (%) : 1
> LOST events : 93
> SAMPLE events : 7164368
> perf.data size (GiB) : 8.5
and this is for AIO and serial:
Command:
/usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \
-a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
-e cpu/period=P,event=0x3c/Duk,\
cpu/period=P,umask=0x3/Duk,\
cpu/period=P,event=0xc0/Duk,\
cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
--clockid=monotonic_raw -- ./matrix.(icc|gcc)
Workload: matrix multiplication in 256 threads
N : 512
P (period, ms) : 2.5
runtime overhead (%) : 2.7x ~ 19.21 / 7.12
data loss (%) : 42
LOST events : 1600
SAMPLE events : 1235928
perf.data size (GiB) : 1.5
N : 272
P (period, ms) : 1.5
runtime overhead (%) : 2.5x ~ 18.09 / 7.12
data loss (%) : 89
LOST events : 3457
SAMPLE events : 1222143
perf.data size (GiB) : 1.5
P (period, ms) : 2
runtime overhead (%) : 2.5x ~ 17.93 / 7.12
data loss (%) : 65
LOST events : 2496
SAMPLE events : 1240754
perf.data size (GiB) : 1.5
P (period, ms) : 2.5
runtime overhead (%) : 2.5x ~ 17.87 / 7.12
data loss (%) : 44
LOST events : 1621
SAMPLE events : 1221949
perf.data size (GiB) : 1.5
P (period, ms) : 3
runtime overhead (%) : 2.5x ~ 18.43 / 7.12
data loss (%) : 12
LOST events : 350
SAMPLE events : 1117972
perf.data size (GiB) : 1.3
N : 128
P (period, ms) : 3
runtime overhead (%) : 2.4x ~ 17.08 / 7.12
data loss (%) : 11
LOST events : 335
SAMPLE events : 1116832
perf.data size (GiB) : 1.3
N : 64
P (period, ms) : 3
runtime overhead (%) : 2.2x ~ 16.03 / 7.12
data loss (%) : 11
LOST events : 329
SAMPLE events : 1108205
perf.data size (GiB) : 1.3
Workload: matrix multiplication in 128 threads
N : 512
P (period, ms) : 1
runtime overhead (%) : 3.5x ~ 23.72 / 6.81
data loss (%) : 18
LOST events : 1043
SAMPLE events : 2015306
perf.data size (GiB) : 2.3
N : 272
P (period, ms) : 0.5
runtime overhead (%) : 3x ~ 22.72 / 6.81
data loss (%) : 90
LOST events : 5842
SAMPLE events : 2205937
perf.data size (GiB) : 2.5
P (period, ms) : 1
runtime overhead (%) : 3x ~ 22.79 / 6.81
data loss (%) : 11
LOST events : 481
SAMPLE events : 2017099
perf.data size (GiB) : 2.5
P (period, ms) : 1.5
runtime overhead (%) : 3x ~ 19.93 / 6.81
data loss (%) : 5
LOST events : 190
SAMPLE events : 1308692
perf.data size (GiB) : 1.5
P (period, ms) : 2
runtime overhead (%) : 3x ~ 18.95 / 6.81
data loss (%) : 0
LOST events : 0
SAMPLE events : 1010769
perf.data size (GiB) : 1.2
N : 128
P (period, ms) : 1.5
runtime overhead (%) : 3x ~ 19.08 / 6.81
data loss (%) : 6
LOST events : 220
SAMPLE events : 1322240
perf.data size (GiB) : 1.5
N : 64
P (period, ms) : 1.5
runtime overhead (%) : 3x ~ 19.43 / 6.81
data loss (%) : 3
LOST events : 130
SAMPLE events : 1386521
perf.data size (GiB) : 1.6
=================================================
Command:
/usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \
-a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
-e cpu/period=P,event=0x3c/Duk,\
cpu/period=P,umask=0x3/Duk,\
cpu/period=P,event=0xc0/Duk,\
cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
--clockid=monotonic_raw -- ./matrix.(icc|gcc)
Workload: matrix multiplication in 256 threads
P (period, ms) : 7.5
runtime overhead (%) : 1.6x ~ 11.6 / 7.12
data loss (%) : 1
LOST events : 1
SAMPLE events : 451062
perf.data size (GiB) : 0.5
Workload: matrix multiplication in 128 threads
P (period, ms) : 3
runtime overhead (%) : 1.8x ~ 12.58 / 6.81
data loss (%) : 9
LOST events : 147
SAMPLE events : 673299
perf.data size (GiB) : 0.8
Thanks,
Alexey
>
> Thanks,
> Alexey
>
>>
>>>
>>> thanks,
>>> jirka
>>>
>>
>