Re: [RFC PATCH] x86/CPU/AMD: Bring back Compute Unit ID

From: Ingo Molnar
Date: Thu Feb 02 2017 - 11:29:20 EST



* Ghannam, Yazen <Yazen.Ghannam@xxxxxxx> wrote:

> Here are my results on a 32C Bulldozer system with an SSD. Also, I use ccache so
> I added "ccache -C" in the pre-build script so the cache gets cleared.
>
> Before:
> Performance counter stats for 'make -s -j65 bzImage' (3 runs):
>
> 2375752.777479 task-clock (msec) # 23.589 CPUs utilized ( +- 0.35% )
> 1,198,979 context-switches # 0.505 K/sec ( +- 0.34% )
> 8,964,671,259 cache-misses ( +- 0.44% )
> 79,399 cpu-migrations # 0.033 K/sec ( +- 1.92% )
> 37,840,875 page-faults # 0.016 M/sec ( +- 0.20% )
> 5,425,612,846,538 cycles # 2.284 GHz ( +- 0.36% )
> 3,367,750,745,825 instructions # 0.62 insn per cycle ( +- 0.11% )
> 750,591,286,261 branches # 315.938 M/sec ( +- 0.11% )
> 43,544,059,077 branch-misses # 5.80% of all branches ( +- 0.08% )
>
> 100.716043494 seconds time elapsed ( +- 1.97% )
>
> After:
> Performance counter stats for 'make -s -j65 bzImage' (3 runs):
>
> 1736720.488346 task-clock (msec) # 23.529 CPUs utilized ( +- 0.16% )
> 1,144,737 context-switches # 0.659 K/sec ( +- 0.20% )
> 8,570,352,975 cache-misses ( +- 0.33% )
> 91,817 cpu-migrations # 0.053 K/sec ( +- 1.67% )
> 37,688,118 page-faults # 0.022 M/sec ( +- 0.03% )
> 5,547,082,899,245 cycles # 3.194 GHz ( +- 0.19% )
> 3,363,365,420,405 instructions # 0.61 insn per cycle ( +- 0.00% )
> 749,676,420,820 branches # 431.662 M/sec ( +- 0.00% )
> 43,243,046,270 branch-misses # 5.77% of all branches ( +- 0.01% )
>
> 73.810517234 seconds time elapsed ( +- 0.02% )

That's pretty impressive: ~35% difference in wall clock performance of this
workload.

And that while both the cycles and the instructions count is within 2.5% of each
other. The only stat the differs beyond the level of noise is cache-misses:

8,964,671,259 cache-misses ( +- 0.44% )
8,570,352,975 cache-misses ( +- 0.33% )

which is 4.5%, but I have trouble believing that just 4.5% more cachemisses can
have such a massive effect on performance.

So unless +4.5% cachemisses can cause a 35% difference in performance this is a
really weird result. Where did the extra performance come from - was the 'good'
workload perhaps running at higher CPU frequencies for some reason?

Thanks,

Ingo