Re: [RFCv2 00/48] perf tools: Add threads to record command

From: Alexey Budankov
Date: Mon Sep 24 2018 - 15:12:23 EST


Hi,

On 24.09.2018 21:32, Alexey Budankov wrote:
> Hi,
>
> On 24.09.2018 17:29, Jiri Olsa wrote:
>> On Mon, Sep 24, 2018 at 04:09:09PM +0300, Alexey Budankov wrote:
>>> Hi,
>>>
>>> On 24.09.2018 10:02, Alexey Budankov wrote:
>>>> Hi,
>>>>
>>>> On 23.09.2018 22:30, Jiri Olsa wrote:
>>>>> On Fri, Sep 21, 2018 at 09:13:08AM +0300, Alexey Budankov wrote:
>>>>>
>>>>> SNIP
>>>>>
>>>>>> Events:
>>>>>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
>>>>>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
>>>>>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
>>>>>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
>>>>>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
>>>>>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
>>>>>>
>>>>>> =================================================
>>>>>>
>>>>>> Command:
>>>>>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
>>>>>> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>>>>>> -e cpu/period=P,event=0x3c/Duk,\
>>>>>> cpu/period=P,umask=0x3/Duk,\
>>>>>> cpu/period=P,event=0xc0/Duk,\
>>>>>> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
>>>>>> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
>>>>>> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
>>>>>> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
>>>>>
>>>>> hum, so I guess the results suck because of the -a option,
>>>>> getting extra samples for all the perf record threads
>>>>>
>>>>> could you try without the -a? you monitor only user events,
>>>>> so you're interested only in ./matrix.* samples, right?
>>>>
>>>> Ok, trying without -a, in per-process mode.
>>>
>>> Command:
>>>
>>> /usr/bin/time ./perf.thr record --threads=T \
>>> -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
>>> -e cpu/period=P,event=0x3c/Duk,\
>>> cpu/period=P,umask=0x3/Duk,\
>>> cpu/period=P,event=0xc0/Duk,\
>>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
>>> cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
>>> cpu/period=0x11171,event=0xc2,umask=0x40/uk \
>>> --clockid=monotonic_raw -- ./matrix.gcc
>>>
>>> Workload: matrix multiplication in 128 threads
>>>
>>> T : 272
>>> P (period, ms) : 0.35
>>> runtime overhead (%) : 13x ~ 87.73 / 6.81
>>
>> how do you meassure this?
>
> This is the ratio of elapsed times:
> runtime overhead (%) : elapsed_time_under_profiling / elapsed_time
> i.e.
>
> /usr/bin/time ./matrix.gcc
> ...
> 767.03user 11.17system 0:06.81elapsed 11424%CPU (0avgtext+0avgdata 100756maxresident)k
> 88inputs+0outputs (0major+139898minor)pagefaults 0swaps
>
> so elapsed_time = 6.81 sec
>
> elapsed_time_uder_profiling is elapsed value from output of
>
> /usr/bin/time ./perf.thr record --threads=T ...
>
>>
>>> data loss (%) : 0
>>> LOST events : 36
>>> SAMPLE events : 8048542
>>> perf.data size (GiB) : 10
>>
>> any idea why does it have some much more samples?
>
> Presumably, this is because period is 350us and this is the smallest
> one that perf.thr manages to capture data without data loss (=0) when T=272.
> However, during collection, I get message that max sampling frequency
> is lowered to 3KHz.

Lowering default frequency rate to 3000.
Please consider tweaking /proc/sys/kernel/perf_event_max_sample_rate.

Thanks,
Alexey

>
> Thanks,
> Alexey
>
>>
>> thanks,
>> jirka
>>
>