Re: performance counter 20% error finding retired instruction count
From: Ingo Molnar
Date: Wed Jun 24 2009 - 11:10:34 EST
* Vince Weaver <vince@xxxxxxxxxx> wrote:
> Hello
>
> As an aside, is it time to set up a dedicated Performance Counters
> for Linux mailing list? (Hereafter referred to as p10c7l to avoid
> confusion with the other implementations that have already taken
> all the good abbreviated forms of the concept).
('perfcounters' is the name of the subsystem/feature and it's
unique.)
> [...] If/when the infrastructure appears in a released kernel,
> there's going to be a lot of chatter by people who use performance
> counters and suddenly find they are stuck with a huge step
> backwards in functionality. And asking Fortran programmers to
> provide kernel patches probably won't be a productive response.
> But I digress.
>
> I was trying to get an exact retired instruction count from
> p10c7l. I am using the test million.s, available here
>
> ( http://www.csl.cornell.edu/~vince/projects/perf_counter/million.s )
>
> It should count exactly one million instructions.
>
> Tests with valgrind and qemu show that it does.
>
> Using perfmon2 on Pentium Pro, PII, PIII, P4, Athlon32, and Phenom
> all give the proper result:
>
> tobler:~% pfmon -e retired_instructions ./million
> 1000002 RETIRED_INSTRUCTIONS
>
> ( it is 1,000,002 +/-2 because on most x86 architectures retired
> instruction count includes any hardware interrupts that might
> happen at the time. It woud be a great feature if p10c7l
> could add some way of gathering the per-process hardware
> instruction count statistic to help quantify that).
>
> Yet with perf on the same Athlon32 machine (using
> kernel 2.6.30-03984-g45e3e19) gives:
>
> tobler:~%perf stat ./million
>
> Performance counter stats for './million':
>
> 1.519366 task-clock-ticks # 0.835 CPU utilization factor
> 3 context-switches # 0.002 M/sec
> 0 CPU-migrations # 0.000 M/sec
> 53 page-faults # 0.035 M/sec
> 2483822 cycles # 1634.775 M/sec
> 1240849 instructions # 816.689 M/sec # 0.500 per cycle
> 612685 cache-references # 403.250 M/sec
> 3564 cache-misses # 2.346 M/sec
>
> Wall-clock time elapsed: 1.819226 msecs
>
> Running multiple times gives:
> 1240849
> 1257312
> 1242313
>
> Which is a varying error of at least 20% which isn't even
> consistent. Is this because of sampling? The documentation
> doesn't really warn about this as far as I can tell.
>
> Thanks for any help resolving this problem
Thanks for the question! There's still gaps in the documentation so
let me explain the basics here:
'perf stat' counts the true cost of executing the command in
question, including the costs of:
fork()ing the task
exec()-ing it
the ELF loader resolving dynamic symbols
the app hitting various pagefaults that instantiate its pagetables
etc.
Those operations are pretty 'noisy' on a typical CPU, with lots of
cache effects, so the noise you see is real.
You can eliminate much of the noise by only counting user-space
instructions, as much of the command startup cost is in
kernel-space.
Running your test-app that way can be done the following way:
$ perf stat --repeat 10 -e 0:1:u ./million
Performance counter stats for './million' (10 runs):
1002106 instructions ( +- 0.015% )
0.000599029 seconds time elapsed.
( note the --repeat feature of perf stat - it does a loop of command
executions and observes the noise and displays it. )
Those ~2100 instructions are executed by your app: as the ELF
dynamic loader starts up your test-app.
If you have some tool that reports less than that then that tool is
not being truthful about the true overhead of your application.
Also note that applications that only execute 1 million instructions
are very, very rare - a modern CPU can execute billions of
instructions, per second, per core.
So i usually test a reference app that is more realistic, that
executes 1 billion instructions:
$ perf stat --repeat 10 -e 0:1:u ./loop_1b_instructions
Performance counter stats for './loop_1b_instructions' (10 runs):
1000079797 instructions ( +- 0.000% )
0.239947420 seconds time elapsed.
the noise there is very low. (despite 230 milliseconds still being a
very short runtime)
Hope this helps - thanks,
Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/