Re: [RFCv2 00/48] perf tools: Add threads to record command
From: Alexey Budankov
Date: Mon Sep 24 2018 - 09:09:16 EST
Hi,
On 24.09.2018 10:02, Alexey Budankov wrote:
> Hi,
>
> On 23.09.2018 22:30, Jiri Olsa wrote:
>> On Fri, Sep 21, 2018 at 09:13:08AM +0300, Alexey Budankov wrote:
>>
>> SNIP
>>
>>> Events:
>>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
>>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
>>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
>>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
>>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
>>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>>>
>>> =================================================
>>>
>>> Command:
>>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
>>> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>>> -e cpu/period=P,event=0x3c/Duk,\
>>> cpu/period=P,umask=0x3/Duk,\
>>> cpu/period=P,event=0xc0/Duk,\
>>> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>>> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>>> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>>> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>>
>> hum, so I guess the results suck because of the -a option,
>> getting extra samples for all the perf record threads
>>
>> could you try without the -a? you monitor only user events,
>> so you're interested only in ./matrix.* samples, right?
>
> Ok, trying without -a, in per-process mode.
Command:
/usr/bin/time ./perf.thr record --threads=T \
-N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
-e cpu/period=P,event=0x3c/Duk,\
cpu/period=P,umask=0x3/Duk,\
cpu/period=P,event=0xc0/Duk,\
cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
cpu/period=0x11171,event=0xc2,umask=0x40/uk \
--clockid=monotonic_raw -- ./matrix.gcc
Workload: matrix multiplication in 128 threads
T : 272
P (period, ms) : 0.35
runtime overhead (%) : 13x ~ 87.73 / 6.81
data loss (%) : 0
LOST events : 36
SAMPLE events : 8048542
perf.data size (GiB) : 10
T : 128
P (period, ms) : 0.35
runtime overhead (%) : 10x ~ 71.12 / 6.81
data loss (%) : 0
LOST events : 2
SAMPLE events : 6524363
perf.data size (GiB) : 8
T : 64
P (period, ms) : 0.35
runtime overhead (%) : 10x ~ 71.89 / 6.81
data loss (%) : 0
LOST events : 2
SAMPLE events : 7160623
perf.data size (GiB) : 9
=================================================
Command:
/usr/bin/time ./perf.aio record --aio=N \
-N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
-e cpu/period=P,event=0x3c/Duk,\
cpu/period=P,umask=0x3/Duk,\
cpu/period=P,event=0xc0/Duk,\
cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
cpu/period=0x11171,event=0xc2,umask=0x40/uk \
--clockid=monotonic_raw ./matrix.gcc
Workload: matrix multiplication in 128 threads
N : 512
P (period, ms) : 1.5
runtime overhead (%) : 2.8x ~ 19.20 / 6.81
data loss (%) : 0
LOST events : 0
SAMPLE events : 1094976
perf.data size (GiB) : 1.3
N : 272
P (period, ms) : 1.5
runtime overhead (%) : 3.3x ~ 22.34 / 6.81
data loss (%) : 0
LOST events : 0
SAMPLE events : 1089252
perf.data size (GiB) : 1.3
N : 128
P (period, ms) : 1.5
runtime overhead (%) : 2.6x ~ 15.15 / 6.81
data loss (%) : 1
LOST events : 1
SAMPLE events : 1094102
perf.data size (GiB) : 1.3
N : 64
P (period, ms) : 1.5
runtime overhead (%) : 2.4x ~ 16.23 / 6.81
data loss (%) : 2
LOST events : 18
SAMPLE events : 1105986
perf.data size (GiB) : 1.3
Thanks,
Alexey
> VTune collects as user as kernel mode samples, using /uk modifiers set.
> The set can be extended to collect in VM host and guests as well.
>
> Thanks,
> Alexey
>
>>
>> thanks,
>> jirka
>>
>