Re: [RFCv2 00/48] perf tools: Add threads to record command
From: Jiri Olsa
Date: Mon Sep 24 2018 - 10:29:32 EST
On Mon, Sep 24, 2018 at 04:09:09PM +0300, Alexey Budankov wrote:
> Hi,
>
> On 24.09.2018 10:02, Alexey Budankov wrote:
> > Hi,
> >
> > On 23.09.2018 22:30, Jiri Olsa wrote:
> >> On Fri, Sep 21, 2018 at 09:13:08AM +0300, Alexey Budankov wrote:
> >>
> >> SNIP
> >>
> >>> Events:
> >>> cpu/period=P,event=0x3c/Duk;CPU_CLK_UNHALTED.THREAD
> >>> cpu/period=P,umask=0x3/Duk;CPU_CLK_UNHALTED.REF_TSC
> >>> cpu/period=P,event=0xc0/Duk;INST_RETIRED.ANY
> >>> cpu/period=0xaae61,event=0xc2,umask=0x10/uk;UOPS_RETIRED.ALL
> >>> cpu/period=0x11171,event=0xc2,umask=0x20/uk;UOPS_RETIRED.SCALAR_SIMD
> >>> cpu/period=0x11171,event=0xc2,umask=0x40/uk;UOPS_RETIRED.PACKED_SIMD
> >>>
> >>> =================================================
> >>>
> >>> Command:
> >>> /usr/bin/time /tmp/vtune_amplifier_2019.574715/bin64/perf.thr record --threads=T \
> >>> -a -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
> >>> -e cpu/period=P,event=0x3c/Duk,\
> >>> cpu/period=P,umask=0x3/Duk,\
> >>> cpu/period=P,event=0xc0/Duk,\
> >>> cpu/period=0x30d40,event=0xc2,umask=0x10/uk,\
> >>> cpu/period=0x4e20,event=0xc2,umask=0x20/uk,\
> >>> cpu/period=0x4e20,event=0xc2,umask=0x40/uk \
> >>> --clockid=monotonic_raw -- ./matrix.(icc|gcc)
> >>
> >> hum, so I guess the results suck because of the -a option,
> >> getting extra samples for all the perf record threads
> >>
> >> could you try without the -a? you monitor only user events,
> >> so you're interested only in ./matrix.* samples, right?
> >
> > Ok, trying without -a, in per-process mode.
>
> Command:
>
> /usr/bin/time ./perf.thr record --threads=T \
> -N -B -T -R --call-graph dwarf,1024 --user-regs=ip,bp,sp \
> -e cpu/period=P,event=0x3c/Duk,\
> cpu/period=P,umask=0x3/Duk,\
> cpu/period=P,event=0xc0/Duk,\
> cpu/period=0xaae61,event=0xc2,umask=0x10/uk,\
> cpu/period=0x11171,event=0xc2,umask=0x20/uk,\
> cpu/period=0x11171,event=0xc2,umask=0x40/uk \
> --clockid=monotonic_raw -- ./matrix.gcc
>
> Workload: matrix multiplication in 128 threads
>
> T : 272
> P (period, ms) : 0.35
> runtime overhead (%) : 13x ~ 87.73 / 6.81
how do you meassure this?
> data loss (%) : 0
> LOST events : 36
> SAMPLE events : 8048542
> perf.data size (GiB) : 10
any idea why does it have some much more samples?
thanks,
jirka