Re: [PATCH 1/2] x86: separating entry text section

From: Ingo Molnar
Date: Tue Feb 22 2011 - 03:09:57 EST



* Jiri Olsa <jolsa@xxxxxxxxxx> wrote:

> Putting x86 entry code to the separate section: .entry.text.

Trying to apply your patch i noticed one detail:

> before patch:
> 26282174 L1-icache-load-misses ( +- 0.099% ) (scaled from 81.00%)
> 0.206651959 seconds time elapsed ( +- 0.152% )
>
> after patch:
> 24237651 L1-icache-load-misses ( +- 0.117% ) (scaled from 80.96%)
> 0.210509948 seconds time elapsed ( +- 0.140% )

So time elapsed actually went up.

hackbench is notoriously unstable when it comes to runtime - and increasing the
--repeat value only has limited effects on that.

Dropping all system caches:

echo 1 > /proc/sys/vm/drop_caches

Seems to do a better job of 'resetting' system state, but if we put that into the
measured workload then the results are all over the place (as we now depend on IO
being done):

# cat hb10

echo 1 > /proc/sys/vm/drop_caches
./hackbench 10

# perf stat --repeat 3 ./hb10

Time: 0.097
Time: 0.095
Time: 0.101

Performance counter stats for './hb10' (3 runs):

21.351257 task-clock-msecs # 0.044 CPUs ( +- 27.165% )
6 context-switches # 0.000 M/sec ( +- 34.694% )
1 CPU-migrations # 0.000 M/sec ( +- 25.000% )
410 page-faults # 0.019 M/sec ( +- 0.081% )
25,407,650 cycles # 1189.984 M/sec ( +- 49.154% )
25,407,650 instructions # 1.000 IPC ( +- 49.154% )
5,126,580 branches # 240.107 M/sec ( +- 46.012% )
192,272 branch-misses # 3.750 % ( +- 44.911% )
901,701 cache-references # 42.232 M/sec ( +- 12.857% )
802,767 cache-misses # 37.598 M/sec ( +- 9.282% )

0.483297792 seconds time elapsed ( +- 31.152% )

So here's a perf stat feature suggestion to solve such measurement problems: a new
'pre-run' 'dry' command could be specified that is executed before the real 'hot'
run is executed. Something like this:

perf stat --pre-run-script ./hb10 --repeat 10 ./hackbench 10

Would do the cache-clearing before each run, it would run hackbench once (dry run)
and then would run hackbench 10 for real - and would repeat the whole thing 10
times. Only the 'hot' portion of the run would be measured and displayed in the perf
stat output event counts.

Another observation:

> 24237651 L1-icache-load-misses ( +- 0.117% ) (scaled from 80.96%)

Could you please do runs that do not display 'scaled from' messages? Since we are
measuring a relatively small effect here, and scaling adds noise, it would be nice
to ensure that the effect persists with non-scaled events as well:

You can do that by reducing the number of events that are measured. The PMU can not
measure all those L1 cache events you listed - so only use the most important one
and add cycles and instructions to make sure the measurements are comparable:

-e L1-icache-load-misses -e instructions -e cycles

Btw., there's another 'perf stat' feature suggestion: it would be nice if it was
possible to 'record' a perf stat run, and do a 'perf diff' over it. That would
compare the two runs all automatically, without you having to do the comparison
manually.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/