Perf and Hackbench results on my machine

From: Hyeonggon Yoo
Date: Mon Oct 11 2021 - 06:33:10 EST


Hello Vlastimil.

On Mon, Oct 11, 2021 at 09:21:01AM +0200, Vlastimil Babka wrote:
> On 10/11/21 00:49, David Rientjes wrote:
> > On Fri, 8 Oct 2021, Hyeonggon Yoo wrote:
> >
> >> It's certain that an object will be not only read, but also
> >> written after allocation.
> >>
> >
> > Why is it certain? I think perhaps what you meant to say is that if we
> > are doing any prefetching here, then access will benefit from prefetchw
> > instead of prefetch. But it's not "certain" that allocated memory will be
> > accessed at all.
>
> I think the primary reason there's a prefetch is freelist traversal. The
> cacheline we prefetch will be read during the next allocation, so if we
> expect there to be one soon, prefetch might help.

I agree that.

> That the freepointer is
> part of object itself and thus the cache line will be probably accessed also
> after the allocation, is secondary.

Right. it depends on cache line size and whether first cache line of an
object is frequently accessed or not.

> Yeah this might help some workloads, but
> perhaps hurt others - these things might look obvious in theory but be
> rather unpredictable in practice. At least some hackbench results would help...
>

Below is my measurement. it seems prefetch(w) is not making things worse
at least on hackbench.

Measured on 16 CPUs (ARM64) / 16G RAM
Without prefetch:

Time: 91.989
Performance counter stats for 'hackbench -g 100 -l 10000':
1467926.03 msec cpu-clock # 15.907 CPUs utilized
17782076 context-switches # 12.114 K/sec
957523 cpu-migrations # 652.296 /sec
104561 page-faults # 71.230 /sec
1622117569931 cycles # 1.105 GHz (54.54%)
2002981132267 instructions # 1.23 insn per cycle (54.32%)
5600876429 branch-misses (54.28%)
642657442307 cache-references # 437.800 M/sec (54.27%)
19404890844 cache-misses # 3.019 % of all cache refs (54.28%)
640413686039 L1-dcache-loads # 436.271 M/sec (46.85%)
19110650580 L1-dcache-load-misses # 2.98% of all L1-dcache accesses (46.83%)
651556334841 dTLB-loads # 443.862 M/sec (46.63%)
3193647402 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.84%)
538927659684 iTLB-loads # 367.135 M/sec (54.31%)
118503839 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
625750168840 L1-icache-loads # 426.282 M/sec (46.80%)
24348083282 L1-icache-load-misses # 3.89% of all L1-icache accesses (46.78%)

92.284351157 seconds time elapsed

44.524693000 seconds user
1426.214006000 seconds sys

With prefetch:

Time: 91.677

Performance counter stats for 'hackbench -g 100 -l 10000':
1462938.07 msec cpu-clock # 15.908 CPUs utilized
18072550 context-switches # 12.354 K/sec
1018814 cpu-migrations # 696.416 /sec
104558 page-faults # 71.471 /sec
2003670016013 instructions # 1.27 insn per cycle (54.31%)
5702204863 branch-misses (54.28%)
643368500985 cache-references # 439.778 M/sec (54.26%)
18475582235 cache-misses # 2.872 % of all cache refs (54.28%)
642206796636 L1-dcache-loads # 438.984 M/sec (46.87%)
18215813147 L1-dcache-load-misses # 2.84% of all L1-dcache accesses (46.83%)
653842996501 dTLB-loads # 446.938 M/sec (46.63%)
3227179675 dTLB-load-misses # 0.49% of all dTLB cache accesses (46.85%)
537531951350 iTLB-loads # 367.433 M/sec (54.33%)
114750630 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.37%)
630135543177 L1-icache-loads # 430.733 M/sec (46.80%)
22923237620 L1-icache-load-misses # 3.64% of all L1-icache accesses (46.76%)

91.964452802 seconds time elapsed

43.416742000 seconds user
1422.441123000 seconds sys

With prefetchw:

Time: 90.220

Performance counter stats for 'hackbench -g 100 -l 10000':
1437418.48 msec cpu-clock # 15.880 CPUs utilized
17694068 context-switches # 12.310 K/sec
958257 cpu-migrations # 666.651 /sec
100604 page-faults # 69.989 /sec
1583259429428 cycles # 1.101 GHz (54.57%)
2004002484935 instructions # 1.27 insn per cycle (54.37%)
5594202389 branch-misses (54.36%)
643113574524 cache-references # 447.409 M/sec (54.39%)
18233791870 cache-misses # 2.835 % of all cache refs (54.37%)
640205852062 L1-dcache-loads # 445.386 M/sec (46.75%)
17968160377 L1-dcache-load-misses # 2.81% of all L1-dcache accesses (46.79%)
651747432274 dTLB-loads # 453.415 M/sec (46.59%)
3127124271 dTLB-load-misses # 0.48% of all dTLB cache accesses (46.75%)
535395273064 iTLB-loads # 372.470 M/sec (54.38%)
113500056 iTLB-load-misses # 0.02% of all iTLB cache accesses (54.35%)
628871845924 L1-icache-loads # 437.501 M/sec (46.80%)
22585641203 L1-icache-load-misses # 3.59% of all L1-icache accesses (46.79%)

90.514819303 seconds time elapsed

43.877656000 seconds user
1397.176001000 seconds sys

Thanks,
Hyeonggon