Re: [RFCv2 00/48] perf tools: Add threads to record command

From: Alexey Budankov
Date: Mon Sep 24 2018 - 15:23:50 EST

Next message: David Miller: "Re: [PATCH] net: qca_spi: Introduce write register verification"
Previous message: Adam Ford: "Re: [PATCH] regulator: pfuze100-regulator: add coin support"
In reply to: Alexey Budankov: "Re: [RFCv2 00/48] perf tools: Add threads to record command"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 21.09.2018 15:15, Alexey Budankov wrote:
> Hello Jiri,
>
> On 21.09.2018 9:13, Alexey Budankov wrote:
>> Hello Jiri,
>>
>> On 14.09.2018 12:37, Alexey Budankov wrote:
>>> On 14.09.2018 11:28, Jiri Olsa wrote:
>>>> On Fri, Sep 14, 2018 at 10:26:53AM +0200, Jiri Olsa wrote:
>>>>
>>>> SNIP
>>>>
>>>>>>> The threaded monitoring currently can't monitor backward maps
>>>>>>> and there are probably more limitations which I haven't spotted
>>>>>>> yet.
>>>>>>>
>>>>>>> So far I tested on laptop:
>>>>>>> http://people.redhat.com/~jolsa/record_threads/test-4CPU.txt
>>>>>>>
>>>>>>> and a one bigger server:
>>>>>>> http://people.redhat.com/~jolsa/record_threads/test-208CPU.txt
>>>>>>>
>>>>>>> I can see decrease in recorded LOST events, but both the benchmark
>>>>>>> and the monitoring must be carefully configured wrt:
>>>>>>> - number of events (frequency)
>>>>>>> - size of the memory maps
>>>>>>> - size of events (callchains)
>>>>>>> - final perf.data size
>>>>>>>
>>>>>>> It's also available in:
>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>>>>>>> perf/record_threads
>>>>>>>
>>>>>>> thoughts? ;-) thanks
>>>>>>> jirka
>>>>>>
>>>>>> It is preferable to split into smaller pieces that bring
>>>>>> some improvement proved by metrics numbers and ready for
>>>>>> merging and upstream. Do we have more metrics than the
>>>>>> data loss from trace AIO patches?
>>>>>
>>>>> well the primary focus is to get more events in,
>>>>> so the LOST metric is the main one
>>>>
>>>> actualy I was hoping, could you please run it through the same
>>>> tests as you do for AIO code on some huge server?
>>>
>>> Yeah, I will, but it takes some time.
>>
>> Here it is:
>>
>> Hardware:
>> cat /proc/cpuinfo
>> processor : 271
>> vendor_id : GenuineIntel
>> cpu family : 6
>> model : 133
>> model name : Intel(R) Xeon Phi(TM) CPU 7285 @ 1.30GHz
>> stepping : 0
>> microcode : 0xe
>> cpu MHz : 1064.235
>> cache size : 1024 KB
>> physical id : 0
>> siblings : 272
>> core id : 73
>> cpu cores : 68
>> apicid : 295
>> initial apicid : 295
>> fpu : yes
>> fpu_exception : yes
>> cpuid level : 13
>> wp : yes
>> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb pti tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts avx512_vpopcntdq avx512_4vnniw avx512_4fmaps
>> bugs : cpu_meltdown spectre_v1 spectre_v2
>> bogomips : 2594.07
>> clflush size : 64
>> cache_alignment : 64
>> address sizes : 46 bits physical, 48 bits virtual
>> power management:
>>
>> uname -a
>> Linux nntpat98-196 4.18.0-rc7+ #2 SMP Thu Sep 6 13:24:37 MSK 2018 x86_64 x86_64 x86_64 GNU/Linux
>>
>> cat /proc/sys/kernel/perf_event_paranoid
>> 0
>>
>> cat /proc/sys/kernel/perf_event_mlock_kb
>> 516
>>
>> cat /proc/sys/kernel/perf_event_max_sample_rate
>> 3000
>>
>> cat /etc/redhat-release
>> Red Hat Enterprise Linux Server release 7.5 (Maipo)
>>
>> Metrics:
>> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
>> data loss (%) : paused_time / elapsed_time_under_profiling
>> LOST events : stat from perf report --stats
>> SAMPLE events : stat from perf report --stats
>> perf.data size (B) : size of trace file on disk
>>
>> Events:
>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>>
>> =================================================
>>
>> Command:
>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
>> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>> -e cpu/period=P,event=0x3c/Duk,\
>> cpu/period=P,umask=0x3/Duk,\
>> cpu/period=P,event=0xc0/Duk,\
>> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>>
>> Workload: matrix multiplication in 256 threads
>>
>> /usr/bin/time ./matrix.icc
>> Addr of buf1 = 0x7ff9faa73010
>> Offs of buf1 = 0x7ff9faa73180
>> Addr of buf2 = 0x7ff9f8a72010
>> Offs of buf2 = 0x7ff9f8a721c0
>> Addr of buf3 = 0x7ff9f6a71010
>> Offs of buf3 = 0x7ff9f6a71100
>> Addr of buf4 = 0x7ff9f4a70010
>> Offs of buf4 = 0x7ff9f4a70140
>> Threads #: 256 Pthreads
>> Matrix size: 2048
>> Using multiply kernel: multiply1
>> Freq = 0.997720 GHz
>> Execution time = 9.061 seconds
>> 1639.55user 6.59system 0:07.12elapsed 23094%CPU (0avgtext+0avgdata 100448maxresident)k
>> 96inputs+0outputs (1major+33839minor)pagefaults 0swaps
>>
>> T : 272
>> P (period, ms) : 0.1
>> runtime overhead (%) : 45x ~ 323.54 / 7.12
>> data loss (%) : 96
>> LOST events : 323662
>> SAMPLE events : 31885479
>> perf.data size (GiB) : 42
>>
>> P (period, ms) : 0.25
>> runtime overhead (%) : 25x ~ 180.76 / 7.12
>> data loss (%) : 69
>> LOST events : 10636
>> SAMPLE events : 18692998
>> perf.data size (GiB) : 23.5
>>
>> P (period, ms) : 0.35
>> runtime overhead (%) : 16x ~ 119.49 / 7.12
>> data loss (%) : 1
>> LOST events : 6
>> SAMPLE events : 11178524
>> perf.data size (GiB) : 14
>>
>> T : 128
>> P (period, ms) : 0.35
>> runtime overhead (%) : 15x ~ 111.98 / 7.12
>> data loss (%) : 62
>> LOST events : 2825
>> SAMPLE events : 11267247
>> perf.data size (GiB) : 15
>>
>> T : 64
>> P (period, ms) : 0.35
>> runtime overhead (%) : 14x ~ 101.55 / 7.12
>> data loss (%) : 67
>> LOST events : 5155
>> SAMPLE events : 10966297
>> perf.data size (GiB) : 13.7
>>
>> Workload: matrix multiplication in 128 threads
>>
>> /usr/bin/time ./matrix.gcc
>> Addr of buf1 = 0x7f072e630010
>> Offs of buf1 = 0x7f072e630180
>> Addr of buf2 = 0x7f072c62f010
>> Offs of buf2 = 0x7f072c62f1c0
>> Addr of buf3 = 0x7f072a62e010
>> Offs of buf3 = 0x7f072a62e100
>> Addr of buf4 = 0x7f072862d010
>> Offs of buf4 = 0x7f072862d140
>> Threads #: 128 Pthreads
>> Matrix size: 2048
>> Using multiply kernel: multiply1
>> Execution time = 6.639 seconds
>> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 100756maxresident)k
>> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps
>>
>> T : 272
>> P (period, ms) : 0.1
>> runtime overhead (%) : 29x ~ 198.81 / 6.81
>> data loss (%) : 21
>> LOST events : 2502
>> SAMPLE events : 22481062
>> perf.data size (GiB) : 27.6
>>
>> P (period, ms) : 0.25
>> runtime overhead (%) : 13x ~ 88.47 / 6.81
>> data loss (%) : 0
>> LOST events : 0
>> SAMPLE events : 9572787
>> perf.data size (GiB) : 11.3
>>
>> P (period, ms) : 0.35
>> runtime overhead (%) : 10x ~ 67.11 / 6.81
>> data loss (%) : 1
>> LOST events : 137
>> SAMPLE events : 6985930
>> perf.data size (GiB) : 8
>>
>> T : 128
>> P (period, ms) : 0.35
>> runtime overhead (%) : 9.5x ~ 64.33 / 6.81
>> data loss (%) : 1
>> LOST events : 3
>> SAMPLE events : 6666903
>> perf.data size (GiB) : 7.8
>>
>> T : 64
>> P (period, ms) : 0.25
>> runtime overhead (%) : 17x ~ 114.27 / 6.81
>> data loss (%) : 2
>> LOST events : 52
>> SAMPLE events : 12643645
>> perf.data size (GiB) : 15.5
>>
>> P (period, ms) : 0.35
>> runtime overhead (%) : 10x ~ 68.60 / 6.81
>> data loss (%) : 1
>> LOST events : 93
>> SAMPLE events : 7164368
>> perf.data size (GiB) : 8.5
>
> and this is for AIO and serial:
>
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.aio record --aio=N \
> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
> -e cpu/period=P,event=0x3c/Duk,\
> cpu/period=P,umask=0x3/Duk,\
> cpu/period=P,event=0xc0/Duk,\
> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>
> Workload: matrix multiplication in 256 threads
>
> N : 512
> P (period, ms) : 2.5
> runtime overhead (%) : 2.7x ~ 19.21 / 7.12
> data loss (%) : 42
> LOST events : 1600
> SAMPLE events : 1235928
> perf.data size (GiB) : 1.5
>
> N : 272
> P (period, ms) : 1.5
> runtime overhead (%) : 2.5x ~ 18.09 / 7.12
> data loss (%) : 89
> LOST events : 3457
> SAMPLE events : 1222143
> perf.data size (GiB) : 1.5
>
> P (period, ms) : 2
> runtime overhead (%) : 2.5x ~ 17.93 / 7.12
> data loss (%) : 65
> LOST events : 2496
> SAMPLE events : 1240754
> perf.data size (GiB) : 1.5
>
> P (period, ms) : 2.5
> runtime overhead (%) : 2.5x ~ 17.87 / 7.12
> data loss (%) : 44
> LOST events : 1621
> SAMPLE events : 1221949
> perf.data size (GiB) : 1.5
>
> P (period, ms) : 3
> runtime overhead (%) : 2.5x ~ 18.43 / 7.12
> data loss (%) : 12
> LOST events : 350
> SAMPLE events : 1117972
> perf.data size (GiB) : 1.3
>
> N : 128
> P (period, ms) : 3
> runtime overhead (%) : 2.4x ~ 17.08 / 7.12
> data loss (%) : 11
> LOST events : 335
> SAMPLE events : 1116832
> perf.data size (GiB) : 1.3
>
> N : 64
> P (period, ms) : 3
> runtime overhead (%) : 2.2x ~ 16.03 / 7.12
> data loss (%) : 11
> LOST events : 329
> SAMPLE events : 1108205
> perf.data size (GiB) : 1.3
>
> Workload: matrix multiplication in 128 threads
>
> N : 512
> P (period, ms) : 1
> runtime overhead (%) : 3.5x ~ 23.72 / 6.81
> data loss (%) : 18
> LOST events : 1043
> SAMPLE events : 2015306
> perf.data size (GiB) : 2.3
>
> N : 272
> P (period, ms) : 0.5
> runtime overhead (%) : 3x ~ 22.72 / 6.81
> data loss (%) : 90
> LOST events : 5842
> SAMPLE events : 2205937
> perf.data size (GiB) : 2.5
>
> P (period, ms) : 1
> runtime overhead (%) : 3x ~ 22.79 / 6.81
> data loss (%) : 11
> LOST events : 481
> SAMPLE events : 2017099
> perf.data size (GiB) : 2.5
>
> P (period, ms) : 1.5
> runtime overhead (%) : 3x ~ 19.93 / 6.81
> data loss (%) : 5
> LOST events : 190
> SAMPLE events : 1308692
> perf.data size (GiB) : 1.5
>
> P (period, ms) : 2
> runtime overhead (%) : 3x ~ 18.95 / 6.81
> data loss (%) : 0
> LOST events : 0
> SAMPLE events : 1010769
> perf.data size (GiB) : 1.2
>
> N : 128
> P (period, ms) : 1.5
> runtime overhead (%) : 3x ~ 19.08 / 6.81
> data loss (%) : 6
> LOST events : 220
> SAMPLE events : 1322240
> perf.data size (GiB) : 1.5
>
> N : 64
> P (period, ms) : 1.5
> runtime overhead (%) : 3x ~ 19.43 / 6.81
> data loss (%) : 3
> LOST events : 130
> SAMPLE events : 1386521
> perf.data size (GiB) : 1.6
>
> =================================================
>
> Command:
> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf record \
> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
> -e cpu/period=P,event=0x3c/Duk,\
> cpu/period=P,umask=0x3/Duk,\
> cpu/period=P,event=0xc0/Duk,\
> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>
> Workload: matrix multiplication in 256 threads
>
> P (period, ms) : 7.5
> runtime overhead (%) : 1.6x ~ 11.6 / 7.12
> data loss (%) : 1
> LOST events : 1
> SAMPLE events : 451062
> perf.data size (GiB) : 0.5
>
> Workload: matrix multiplication in 128 threads
>
> P (period, ms) : 3
> runtime overhead (%) : 1.8x ~ 12.58 / 6.81
> data loss (%) : 9
> LOST events : 147
> SAMPLE events : 673299
> perf.data size (GiB) : 0.8

Please see more comparable data by P (period, ms),
runtime overhead and data loss metrics at the same time.

It start from serial implementation as the baseline and
then demonstrates possible improvement applying configurable
--aio(=N) and --threads(=T) implementations.

Smaller P values, with data loss and runtime overhead values
equal or in small vicinity of the ones from serial implementation,
might mean possible gain.

Workload: matrix multiplication in 128 threads

Serial:
P (period, ms) : 3
runtime overhead (%) : 1.8x ~ 12.58 / 6.81
data loss (%) : 9
LOST events : 147
SAMPLE events : 673299
perf.data size (GiB) : 0.8

AIO:
N : 1
P (period, ms) : 3
runtime overhead (%) : 1.8x ~ 12.42 / 6.81
data loss (%) : 2
LOST events : 19
SAMPLE events : 664749
perf.data size (GiB) : 0.75

N : 4
P (period, ms) : 1.8
runtime overhead (%) : 1.8x ~ 12.74 / 6.81
data loss (%) : 10
LOST events : 257
SAMPLE events : 1079250
perf.data size (GiB) : 1.25

Threads:
T : 1
P (period, ms) : 3
runtime overhead (%) : 2.6x ~ 17.73 / 6.81
data loss (%) : 6
LOST events : 95
SAMPLE events : 665844
perf.data size (GiB) : 0.78

T : 2
P (period, ms) : 3
runtime overhead (%) : 2.6x ~ 18.04 / 6.81
data loss (%) : 0
LOST events : 0
SAMPLE events : 662075
perf.data size (GiB) : 0.8

P (period, ms) : 1.8
runtime overhead (%) : 3x ~ 20.83 / 6.81
data loss (%) : 4
LOST events : 76
SAMPLE events : 1085826
perf.data size (GiB) : 1.25

T : 4
P (period, ms) : 3
runtime overhead (%) : 2.6x ~ 17.85 / 6.81
data loss (%) : 0
LOST events : 0
SAMPLE events : 665262
perf.data size (GiB) : 0.78

P (period, ms) : 1.8
runtime overhead (%) : 3x ~ 21.15 / 6.81
data loss (%) : 0
LOST events : 0
SAMPLE events : 1126563
perf.data size (GiB) : 1.3

P (period, ms) : 1
runtime overhead (%) : 4.35x ~ 29.6 / 6.81
data loss (%) : 0
LOST events : 6
SAMPLE events : 2124837
perf.data size (GiB) : 2.5

P (period, ms) : 0.8
runtime overhead (%) : 4.8x ~ 32.62 / 6.81
data loss (%) : 12
LOST events : 536
SAMPLE events : 2620345
perf.data size (GiB) : 3

Thanks,
Alexey

>
> Thanks,
> Alexey
>
>>
>> Thanks,
>> Alexey
>>
>>>
>>>>
>>>> thanks,
>>>> jirka
>>>>
>>>
>>
>

Next message: David Miller: "Re: [PATCH] net: qca_spi: Introduce write register verification"
Previous message: Adam Ford: "Re: [PATCH] regulator: pfuze100-regulator: add coin support"
In reply to: Alexey Budankov: "Re: [RFCv2 00/48] perf tools: Add threads to record command"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]