Re: performance counter ~0.4% error finding retired instructioncount

From: Ingo Molnar
Date: Sat Jun 27 2009 - 02:05:01 EST



* Vince Weaver <vince@xxxxxxxxxx> wrote:

> On Fri, 26 Jun 2009, Vince Weaver wrote:
>
>> From the best I can tell digging through the perf sources, the
>> performance counters are set up and started in userspace, but instead
>> of doing an immediate clone/exec, thousands of instructions worth of
>> other stuff is done by perf in between.
>
> and for the curious, wondering how a simple
>
> prctl(COUNTERS_ENABLE);
> fork()
> execvp()
>
> can cause 6000+ instructions of non-deterministic execution, it
> turns out that perf is dynamically linked. So it has to spend
> 5000+ cycles in ld-linux.so resolving the excecvp() symbol before
> it can actually execvp.

I measured 2000, but generally a few thousand cycles per invocation
sounds about right.

That is in the 0.0001% measurement overhead range (per 'perf stat'
invocation) for any realistic app that does something worth
measuring - and even with a worst-case 'cheapest app' case it is in
the 0.2-0.4% range.

Besides, you compare perfcounters to perfmon (which you seem to be a
contributor of), while in reality perfmon has much, much worse (and
unfixable, because designed-in) measurement overhead.

So why are you criticising perfcounters for a 5000 cycles
measurement overhead while perfmon has huge, _hundreds of millions_
of cycles measurement overhead (per second) for various realistic
workloads? [ In fact in one of the scheduler-tests perfmon has a
whopping measurement overhead of _nine billion_ cycles, it increased
total runtime of the workload from 3.3 seconds to 6.6 seconds. (!) ]

Why are you using a double standard here?

Here are some numbers to put the 5000 cycles startup cost into
perspective. For example the default startup costs of even the
simplest Linux binaries (/bin/true):

titan:~> perf stat /bin/true

Performance counter stats for '/bin/true':

0.811328 task-clock-msecs # 1.002 CPUs
1 context-switches # 0.001 M/sec
1 CPU-migrations # 0.001 M/sec
180 page-faults # 0.222 M/sec
1267713 cycles # 1562.516 M/sec
733772 instructions # 0.579 IPC
26261 cache-references # 32.368 M/sec
531 cache-misses # 0.654 M/sec

0.000809407 seconds time elapsed

5000/1267713 cycles is in the 0.4% range. Run any app that actually
does something beyond starting up, an app which has a chance to get
a decent cache footprint and gets into steady state so that it gets
stable properties that can be measured reliably - and you'll get
into the billions of cycles range or more - at which point a few
thousand cycles is in the 0.0001% measurement overhead range.

Compare to this the intrinsic noise of cycles metrics for some
benchmark like hackbench:

titan:~> perf stat -r 2 -e 0:0 -- ~/hackbench 10
Time: 0.448
Time: 0.447

Performance counter stats for '/home/mingo/hackbench 10' (2 runs):

2661715310 cycles ( +- 0.588% )

0.480153304 seconds time elapsed ( +- 0.549% )

The noise in this (very short) hackbench run above was 15 _million_
cycles. See how small a few thousand cycles are?

If the 5 thousand cycles measurement overhead _still_ matters to you
under such circumstances then by all means please submit the patches
to improve it. Despite your claims this is totally fixable with the
current perfcounters design, Peter outlined the steps of how to
solve it, you can utilize ptrace if you want to.

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/