Re: [RFC PATCH] x86/64: Optimize the effective instruction cache footprint of kernel functions

From: Ingo Molnar
Date: Thu May 21 2015 - 07:36:30 EST



* Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:

> I was thinking about Ingo's AMD results:
>
> linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed
> linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed
> linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed
>
> AMD is almost perfect. Having no alignment at all still works very
> well. [...]

Not quite. As I mentioned it in my post, the 'time elapsed' numbers
were very noisy in the AMD case - and you've cut off the stddev column
that shows this. Here is the full data:

linux-falign-functions=_64-bytes/res-amd.txt: 1.928409143 seconds time elapsed ( +- 2.74% )
linux-falign-functions=__8-bytes/res-amd.txt: 1.940703051 seconds time elapsed ( +- 1.84% )
linux-falign-functions=__1-bytes/res-amd.txt: 1.940744001 seconds time elapsed ( +- 2.15% )

2-3% of stddev for a 3.7% speedup is not conclusive.

What you should use instead is the cachemiss counts, which is a good
proxy and a lot more stable statistically:

linux-falign-functions=_64-bytes/res-amd.txt: 108,886,550 L1-icache-load-misses ( +- 0.10% ) (100.00%)
linux-falign-functions=__8-bytes/res-amd.txt: 123,810,566 L1-icache-load-misses ( +- 0.18% ) (100.00%)
linux-falign-functions=__1-bytes/res-amd.txt: 113,623,200 L1-icache-load-misses ( +- 0.17% ) (100.00%)

which shows that 64 bytes alignment still generates a better I$ layout
than tight packing, resulting in 4.3% fewer I$ misses.

On Intel it's more pronounced:

linux-falign-functions=_64-bytes/res.txt: 647,853,942 L1-icache-load-misses ( +- 0.07% ) (100.00%)
linux-falign-functions=__1-bytes/res.txt: 724,539,055 L1-icache-load-misses ( +- 0.31% ) (100.00%)

12% difference. Note that the Intel workload is running on SSDs which
makes the cache footprint several times larger, and the workload is
more realistic as well than the AMD test that was running in tmpfs.

I think it's a fair bet to assume that the AMD system will show a
similar difference if it were to run the same workload.

Allowing smaller functions to be cut in half by cacheline boundaries
looks like a losing strategy, especially with larger workloads.

The modified scheme I suggested: 64 bytes alignment + intelligent
packing might do even better than dumb 64 bytes alignment.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/