Re: [RFCv2 00/48] perf tools: Add threads to record command
From: Alexey Budankov
Date: Mon Sep 24 2018 - 14:34:29 EST
Hi,
On 24.09.2018 17:29, Jiri Olsa wrote:
> On Mon, Sep 24, 2018 at 04:09:09PM +0300, Alexey Budankov wrote:
>> Hi,
>>
>> On 24.09.2018 10:02, Alexey Budankov wrote:
>>> Hi,
>>>
>>> On 23.09.2018 22:30, Jiri Olsa wrote:
>>>> On Fri, Sep 21, 2018 at 09:13:08AM +0300, Alexey Budankov wrote:
>>>>
>>>> SNIP
>>>>
>>>>> Events:
>>>>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
>>>>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
>>>>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
>>>>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
>>>>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
>>>>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>>>>>
>>>>> =================================================
>>>>>
>>>>> Command:
>>>>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
>>>>> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>>>>> -e cpu/period=P,event=0x3c/Duk,\
>>>>> cpu/period=P,umask=0x3/Duk,\
>>>>> cpu/period=P,event=0xc0/Duk,\
>>>>> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>>>>> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>>>>> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>>>>> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>>>>
>>>> hum, so I guess the results suck because of the -a option,
>>>> getting extra samples for all the perf record threads
>>>>
>>>> could you try without the -a? you monitor only user events,
>>>> so you're interested only in ./matrix.* samples, right?
>>>
>>> Ok, trying without -a, in per-process mode.
>>
>> Command:
>>
>> /usr/bin/time ./perf.thr record --threads=T \
>> -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>> -e cpu/period=P,event=0x3c/Duk,\
>> cpu/period=P,umask=0x3/Duk,\
>> cpu/period=P,event=0xc0/Duk,\
>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
>> cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
>> cpu/period=0x11171,event=0xc2,umask=0x40/uk \
>> --clockid=monotonic_raw -- ./matrix.gcc
>>
>> Workload: matrix multiplication in 128 threads
>>
>> T : 272
>> P (period, ms) : 0.35
>> runtime overhead (%) : 13x ~ 87.73 / 6.81
>
> how do you meassure this?
This is the ratio of elapsed times:
runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
i.e.
/usr/bin/time ./matrix.gcc
...
767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 100756maxresident)k
88inputs+0outputs (0major+139898minor)pagefaults 0swaps
so elapsed_time = 6.81 sec
elapsed_time_uder_profiling is elapsed value from output of
/usr/bin/time ./perf.thr record --threads=T ...
>
>> data loss (%) : 0
>> LOST events : 36
>> SAMPLE events : 8048542
>> perf.data size (GiB) : 10
>
> any idea why does it have some much more samples?
Presumably, this is because period is 350us and this is the smallest
one that perf.thr manages to capture data without data loss (=0) when T=272.
However, during collection, I get message that max sampling frequency
is lowered to 3KHz.
Thanks,
Alexey
>
> thanks,
> jirka
>