Re: [mm/contpte v3 0/1] mm/contpte: Optimize loop to reduce redundant operations

From: Andrew Morton
Date: Tue Apr 15 2025 - 22:10:35 EST



Please try to avoid presentation of a [0/N] cover letter when N==1! A
simple singleton patch is better.

On Tue, 15 Apr 2025 16:22:04 +0800 Xavier <xavier_qy@xxxxxxx> wrote:

> Patch V3 has changed the while loop to a for loop according to the suggestions
> of Dev. Meanwhile, to improve efficiency, the definition of local variables has
> been removed. This macro is only used within the current function and there

which function?

> will be no additional risks. In order to verify the optimization performance of
> Patch V3, a test function has been designed. By repeatedly calling mlock in a
> loop, the kernel is made to call contpte_ptep_get extensively to test the
> optimization effect of this function.
> The function's execution time and instruction statistics have been traced using
> perf, and the following are the operation results on a certain Qualcomm mobile
> phone chip:

All the words thus far appear to be discussing changes since v2. For
the permanent kernel record, this isn't interesting or useful material.
So please present a standalone description which doesn't refer to
previous iterations.

It's great to present this what-i-changed-since-last-time material, but
that is better placed after the "^---$" separator, after the
Signed-off-by:, Reviewed-by: etc tags.

>
> ...
>


Below is what I came up with for a changelog. Please check?

Optimize contpte_ptep_get() by adding early termination logic. Check if
the dirty and young bits of orig_pte are already set and skip redundant
bit-setting operations during the loop. This reduces unnecessary
iterations and improves performance.

The function's execution time and instruction statistics have been traced
using perf, and the following are the operation results on a certain
Qualcomm mobile phone chip:

Instruction Statistics - Before Optimization
# count event_name # count / runtime
20,814,352 branch-load-misses # 662.244 K/sec
41,894,986,323 branch-loads # 1.333 G/sec
1,957,415 iTLB-load-misses # 62.278 K/sec
49,872,282,100 iTLB-loads # 1.587 G/sec
302,808,096 L1-icache-load-misses # 9.634 M/sec
49,872,282,100 L1-icache-loads # 1.587 G/sec

Total test time: 31.485237 seconds.

Instruction Statistics - After Optimization
# count event_name # count / runtime
19,340,524 branch-load-misses # 688.753 K/sec
38,510,185,183 branch-loads # 1.371 G/sec
1,812,716 iTLB-load-misses # 64.554 K/sec
47,673,923,151 iTLB-loads # 1.698 G/sec
675,853,661 L1-icache-load-misses # 24.068 M/sec
47,673,923,151 L1-icache-loads # 1.698 G/sec

Total test time: 28.108048 seconds.

Function Statistics - Before Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1419716
Event count: 99618088900

Overhead Symbol
21.42% lock_release
21.26% lock_acquire
20.88% arch_counter_get_cntvct
14.32% _raw_spin_unlock_irq
6.79% contpte_ptep_get
2.20% test_contpte_perf
1.82% follow_page_pte
0.97% lock_acquired
0.97% rcu_is_watching
0.89% mlock_pte_range
0.84% sched_clock_noinstr
0.70% handle_softirqs.llvm.8218488130471452153
0.58% test_preempt_disable_long
0.57% _raw_spin_unlock_irqrestore
0.54% arch_stack_walk
0.51% vm_normal_folio
0.48% check_preemption_disabled
0.47% stackinfo_get_task
0.36% try_grab_folio
0.34% preempt_count
0.32% trace_preempt_on
0.29% trace_preempt_off
0.24% debug_smp_processor_id

Function Statistics - After Optimization
Arch: arm64
Event: cpu-cycles (type 0, config 0)
Samples: 1431006
Event count: 118856425042

Overhead Symbol
22.59% lock_release
22.13% arch_counter_get_cntvct
22.08% lock_acquire
15.32% _raw_spin_unlock_irq
2.26% test_contpte_perf
1.50% follow_page_pte
1.49% arch_stack_walk
1.30% rcu_is_watching
1.09% lock_acquired
1.07% sched_clock_noinstr
0.88% handle_softirqs.llvm.12507768597002095717
0.88% trace_preempt_off
0.76% _raw_spin_unlock_irqrestore
0.61% check_preemption_disabled
0.52% trace_preempt_on
0.50% mlock_pte_range
0.43% try_grab_folio
0.41% folio_mark_accessed
0.40% vm_normal_folio
0.38% test_preempt_disable_long
0.28% contpte_ptep_get
0.27% __traceiter_android_rvh_preempt_disable
0.26% debug_smp_processor_id
0.24% return_address
0.20% __pte_offset_map_lock
0.19% unwind_next_frame_record

If there is no problem with my test program, it can be seen that there is a
significant performance improvement both in the overall number of instructions
and the execution time of contpte_ptep_get.

If any reviewers have time, you can also test it on your machines for comparison.
I have enabled THP and hugepages-64kB.

Test function:

#define PAGE_SIZE 4096
#define CONT_PTES 16
#define TEST_SIZE (4096* CONT_PTES * PAGE_SIZE)

void rwdata(char *buf)
{
for (size_t i = 0; i < TEST_SIZE; i += PAGE_SIZE) {
buf[i] = 'a';
volatile char c = buf[i];
}
}
void test_contpte_perf()
{
char *buf;
int ret = posix_memalign((void **)&buf, PAGE_SIZE, TEST_SIZE);
if (ret != 0) {
perror("posix_memalign failed");
exit(EXIT_FAILURE);
}

rwdata(buf);

for (int j = 0; j < 500; j++) {
mlock(buf, TEST_SIZE);

rwdata(buf);

munlock(buf, TEST_SIZE);
}

free(buf);
}