On Mon, Sep 28, 2020 at 05:22:54PM +1000, Gavin Shan wrote:
Testing
=======
[1] The experiment reveals how heavily the (L1) data cache miss impacts
the overall application's performance. The machine where the test
is carried out has the following L1 data cache topology. In the
mean while, the host kernel have following configurations.
The test case allocates contiguous page frames through HugeTLBfs
and reads 4-bytes data from the same offset (0x0) from these (N)
contiguous page frames. N is equal to 8 or 9 separately in the
following two test cases. This is repeated for one million of
times.
Note that 8 is number of L1 data cache ways. The experiment is
cause L1 cache thrashing on one particular set.
Host: CONFIG_ARM64_PAGE_SHIFT=12
DEFAULT_HUGE_PAGE_SIZE=2MB
L1 dcache: cache-line-size=64
number-of-sets=64
number-of-ways=8
N=8 N=9
------------------------------------------------------------------
cache-misses: 43,429 9,038,460
L1-dcache-load-misses: 43,429 9,038,460
seconds time elapsed: 0.299206372 0.722253140 (2.41 times)
[2] The experiment should have been carried out on machine where the
L1 data cache capacity of one particular way is larger than 4KB.
However, I'm unable to find such kind of machines. So I have to
evaluate the performance impact caused by L2 data cache thrashing.
The experiment is carried out on the machine, which has following
L1/L2 data cache topology. The host kernel configuration is same
to [1].
The corresponding test program allocates contiguous page frames
through hugeTLBfs and builds VMAs backed by zero pages. These
contiguous pages are sequentially read from fixed offset (0) in step
of 32KB and by 8 times. After that, the VMA backed by zero pages are
sequentially read in step of 4KB and by once. It's repeated by 8
millions of times.
Note 32KB is the cache capacity in one L2 data cache way and 8 is
number of L2 data cache sets. This experiment is to cause L2 data
cache thrashing on one particular set.
L1 dcache: <same as [1]>
L2 dcache: cache-line-size=64
number-of-sets=512
number-of-ways=8
-----------------------------------------------------------------------
cache-references: 1,427,213,737 1,421,394,472
cache-misses: 35,804,552 42,636,698
L1-dcache-load-misses: 35,804,552 42,636,698
seconds time elapsed: 2.602511671 2.098198172 (+19.3%)
No-one is denying a performance improvement in a very specific way but
what's missing here is explaining how these artificial benchmarks relate
to real-world applications.