Re: [PATCH v2 0/9] x86/clear_huge_page: multi-page clearing

From: Raghavendra K T
Date: Tue Sep 05 2023 - 12:33:55 EST


On 8/31/2023 12:19 AM, Ankur Arora wrote:
This series adds a multi-page clearing primitive, clear_pages(),
which enables more effective use of x86 string instructions by
advertising the real region-size to be cleared.

Region-size can be used as a hint by uarchs to optimize the
clearing.

Also add allow_resched() which marks a code-section as allowing
rescheduling in the irqentry_exit path. This allows clear_pages()
to get by without having to call cond_sched() periodically.
(preempt_model_full() already handles this via
irqentry_exit_cond_resched(), so we handle this similarly for
preempt_model_none() and preempt_model_voluntary().)



Hello Ankur,
Thansk for the patches.

I tried the patches, Improvements look similar to V1 (even without
circuitous chunk optimizations.)
STill we see similar 50-60% improvement for 1G and 2M page sizes.


SUT: Bergamo
CPU family: 25
Model: 160
Thread(s) per core: 2
Core(s) per socket: 128
Socket(s): 2

NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-127,256-383
NUMA node1 CPU(s): 128-255,384-511

Test: Use mmap(MAP_HUGETLB) to demand a fault on 64GB region (NUMA node0), for both base-hugepage-size=2M and 1GB
Current result is with thp = always, but madv also did not make much difference.

perf stat -r 10 -d -d numactl -m 0 -N 0 <test>

time in seconds elapsed (average of 10 runs) (lower = better)

Result:
base: mm/clear_huge_page
patched: x86/clear_huge_page

page-size base patched Improvement %
2M 5.0779 2.50623 50.64
1G 2.50623 1.012439 59.60

More details:

Performance counter stats for 'mm/map_hugetlb' (10 runs):

5,058.71 msec task-clock # 0.996 CPUs utilized ( +- 0.26% )
8 context-switches # 1.576 /sec ( +- 7.23% )
0 cpu-migrations # 0.000 /sec
32,917 page-faults # 6.484 K/sec ( +- 0.00% )
15,797,804,067 cycles # 3.112 GHz ( +- 0.26% ) (35.70%)
2,073,754 stalled-cycles-frontend # 0.01% frontend cycles idle ( +- 1.25% ) (35.71%)
27,508,977 stalled-cycles-backend # 0.17% backend cycles idle ( +- 9.48% ) (35.74%)
1,143,710,651 instructions # 0.07 insn per cycle
# 0.03 stalled cycles per insn ( +- 0.15% ) (35.76%)
243,817,330 branches # 48.028 M/sec ( +- 0.12% ) (35.78%)
357,760 branch-misses # 0.15% of all branches ( +- 1.52% ) (35.75%)
2,540,733,497 L1-dcache-loads # 500.483 M/sec ( +- 0.04% ) (35.74%)
1,093,660,557 L1-dcache-load-misses # 42.98% of all L1-dcache accesses ( +- 0.03% ) (35.71%)
73,335,478 L1-icache-loads # 14.446 M/sec ( +- 0.08% ) (35.70%)
878,378 L1-icache-load-misses # 1.19% of all L1-icache accesses ( +- 2.65% ) (35.68%)
1,025,714 dTLB-loads # 202.049 K/sec ( +- 2.70% ) (35.69%)
405,407 dTLB-load-misses # 37.35% of all dTLB cache accesses ( +- 1.59% ) (35.68%)
2 iTLB-loads # 0.394 /sec ( +- 41.63% ) (35.68%)
40,356 iTLB-load-misses # 1552153.85% of all iTLB cache accesses ( +- 7.18% ) (35.68%)

5.0779 +- 0.0132 seconds time elapsed ( +- 0.26% )

Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb' (10 runs):

2,538.40 msec task-clock # 1.013 CPUs utilized ( +- 0.27% )
4 context-switches # 1.597 /sec ( +- 6.51% )
1 cpu-migrations # 0.399 /sec
32,916 page-faults # 13.140 K/sec ( +- 0.00% )
7,901,830,782 cycles # 3.154 GHz ( +- 0.27% ) (35.67%)
6,590,473 stalled-cycles-frontend # 0.08% frontend cycles idle ( +- 10.31% ) (35.71%)
329,970,288 stalled-cycles-backend # 4.23% backend cycles idle ( +- 13.65% ) (35.74%)
725,811,962 instructions # 0.09 insn per cycle
# 0.80 stalled cycles per insn ( +- 0.37% ) (35.78%)
132,182,704 branches # 52.767 M/sec ( +- 0.26% ) (35.82%)
254,163 branch-misses # 0.19% of all branches ( +- 2.47% ) (35.81%)
2,382,927,453 L1-dcache-loads # 951.262 M/sec ( +- 0.04% ) (35.77%)
1,082,022,067 L1-dcache-load-misses # 45.41% of all L1-dcache accesses ( +- 0.02% ) (35.74%)
47,164,491 L1-icache-loads # 18.828 M/sec ( +- 0.37% ) (35.70%)
474,535 L1-icache-load-misses # 0.99% of all L1-icache accesses ( +- 2.93% ) (35.66%)
1,477,334 dTLB-loads # 589.750 K/sec ( +- 5.12% ) (35.65%)
624,125 dTLB-load-misses # 56.24% of all dTLB cache accesses ( +- 5.66% ) (35.65%)
0 iTLB-loads # 0.000 /sec (35.65%)
1,626 iTLB-load-misses # 7069.57% of all iTLB cache accesses ( +-283.51% ) (35.65%)

2.50623 +- 0.00691 seconds time elapsed ( +- 0.28% )


Performance counter stats for 'numactl -m 0 -N 0 mm/map_hugetlb_1G' (10 runs):


2,506.50 msec task-clock # 0.995 CPUs utilized ( +- 0.17% )
4 context-switches # 1.589 /sec ( +- 9.28% )
0 cpu-migrations # 0.000 /sec
214 page-faults # 84.997 /sec ( +- 0.13% )
7,821,519,053 cycles # 3.107 GHz ( +- 0.17% ) (35.72%)
2,037,744 stalled-cycles-frontend # 0.03% frontend cycles idle ( +- 25.62% ) (35.73%)
6,578,899 stalled-cycles-backend # 0.08% backend cycles idle ( +- 2.65% ) (35.73%)
468,648,780 instructions # 0.06 insn per cycle
# 0.01 stalled cycles per insn ( +- 0.10% ) (35.73%)
116,267,370 branches # 46.179 M/sec ( +- 0.08% ) (35.73%)
111,966 branch-misses # 0.10% of all branches ( +- 2.98% ) (35.72%)
2,294,727,165 L1-dcache-loads # 911.424 M/sec ( +- 0.02% ) (35.71%)
1,076,156,463 L1-dcache-load-misses # 46.88% of all L1-dcache accesses ( +- 0.01% ) (35.70%)
26,093,151 L1-icache-loads # 10.364 M/sec ( +- 0.21% ) (35.71%)
132,944 L1-icache-load-misses # 0.51% of all L1-icache accesses ( +- 0.55% ) (35.70%)
30,925 dTLB-loads # 12.283 K/sec ( +- 5.70% ) (35.71%)
27,437 dTLB-load-misses # 86.22% of all dTLB cache accesses ( +- 1.98% ) (35.70%)
0 iTLB-loads # 0.000 /sec (35.71%)
11 iTLB-load-misses # 62.50% of all iTLB cache accesses ( +-140.21% ) (35.70%)

2.51890 +- 0.00433 seconds time elapsed ( +- 0.17% )

Performance counter stats for 'numactl -m 0 -N 0 x86/map_hugetlb_1G' (10 runs):

1,013.59 msec task-clock # 1.001 CPUs utilized ( +- 0.07% )
2 context-switches # 1.978 /sec ( +- 12.91% )
1 cpu-migrations # 0.989 /sec
213 page-faults # 210.634 /sec ( +- 0.17% )
3,169,391,694 cycles # 3.134 GHz ( +- 0.07% ) (35.53%)
109,925 stalled-cycles-frontend # 0.00% frontend cycles idle ( +- 5.56% ) (35.63%)
950,638,913 stalled-cycles-backend # 30.06% backend cycles idle ( +- 5.06% ) (35.73%)
51,189,571 instructions # 0.02 insn per cycle
# 21.03 stalled cycles per insn ( +- 1.22% ) (35.82%)
9,545,941 branches # 9.440 M/sec ( +- 1.50% ) (35.92%)
86,836 branch-misses # 0.88% of all branches ( +- 3.74% ) (36.00%)
46,109,587 L1-dcache-loads # 45.597 M/sec ( +- 3.92% ) (35.96%)
13,796,172 L1-dcache-load-misses # 41.77% of all L1-dcache accesses ( +- 4.81% ) (35.85%)
1,179,166 L1-icache-loads # 1.166 M/sec ( +- 1.22% ) (35.77%)
21,528 L1-icache-load-misses # 1.90% of all L1-icache accesses ( +- 1.85% ) (35.66%)
14,529 dTLB-loads # 14.368 K/sec ( +- 4.65% ) (35.57%)
8,505 dTLB-load-misses # 67.88% of all dTLB cache accesses ( +- 5.61% ) (35.52%)
0 iTLB-loads # 0.000 /sec (35.52%)
8 iTLB-load-misses # 0.00% of all iTLB cache accesses ( +-267.99% ) (35.52%)

1.012439 +- 0.000723 seconds time elapsed ( +- 0.07% )


Please feel free to carry:

Tested-by: Raghavendra K T <raghavendra.kt@xxxxxxx>
for any minor changes.

Thanks and Regards
- Raghu