Re: [linus:master] [mm] cacded5e42: aim9.brk_test.ops_per_sec -5.0% regression

From: Oliver Sang
Date: Tue Oct 08 2024 - 04:53:09 EST


hi, Lorenzo Stoakes,

sorry for late, we are in holidays last week.

On Mon, Sep 30, 2024 at 09:21:52AM +0100, Lorenzo Stoakes wrote:
> On Mon, Sep 30, 2024 at 10:21:27AM GMT, kernel test robot wrote:
> >
> >
> > Hello,
> >
> > kernel test robot noticed a -5.0% regression of aim9.brk_test.ops_per_sec on:
> >
> >
> > commit: cacded5e42b9609b07b22d80c10f0076d439f7d1 ("mm: avoid using vma_merge() for new VMAs")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > testcase: aim9
> > test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 64G memory
>
> Hm, quite an old microarchitecture no?
>
> Would it be possible to try this on a range of uarch's, especially more
> recent noes, with some repeated runs to rule out statistical noise? Much
> appreciated!

we run this test on below platforms, and observed similar regression.
one thing I want to mention is for performance tests, we run one commit at least
6 times. for this aim9 test, the data is quite stable, so there is no %stddev
value in our table. we won't show this value if it's <2%

(1)

model: Granite Rapids
nr_node: 1
nr_cpu: 240
memory: 192G

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-gnr-1ap1/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
%stddev %change %stddev
\ | \
3220697 -6.0% 3028867 aim9.brk_test.ops_per_sec


(2)

model: Emerald Rapids
nr_node: 4
nr_cpu: 256
memory: 256G
brand: INTEL(R) XEON(R) PLATINUM 8592+

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-emr-2sp1/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
%stddev %change %stddev
\ | \
3669298 -6.5% 3430070 aim9.brk_test.ops_per_sec


(3)

model: Sapphire Rapids
nr_node: 2
nr_cpu: 224
memory: 512G
brand: Intel(R) Xeon(R) Platinum 8480CTDX

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
%stddev %change %stddev
\ | \
3540976 -6.4% 3314159 aim9.brk_test.ops_per_sec


(4)

model: Ice Lake
nr_node: 2
nr_cpu: 64
memory: 256G
brand: Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-icl-2sp9/brk_test/aim9/300s

fc21959f74bc1138 cacded5e42b9609b07b22d80c10
---------------- ---------------------------
%stddev %change %stddev
\ | \
2667734 -5.6% 2518021 aim9.brk_test.ops_per_sec


>
> > parameters:
> >
> > testtime: 300s
> > test: brk_test
> > cpufreq_governor: performance
> >
> >
> >
> >
> > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > the same patch/commit), kindly add following tags
> > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > | Closes: https://lore.kernel.org/oe-lkp/202409301043.629bea78-oliver.sang@xxxxxxxxx
> >
> >
> > Details are as below:
> > -------------------------------------------------------------------------------------------------->
> >
> >
> > The kernel config and materials to reproduce are available at:
> > https://download.01.org/0day-ci/archive/20240930/202409301043.629bea78-oliver.sang@xxxxxxxxx
> >
> > =========================================================================================
> > compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> > gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-ivb-2ep2/brk_test/aim9/300s
> >
> > commit:
> > fc21959f74 ("mm: abstract vma_expand() to use vma_merge_struct")
> > cacded5e42 ("mm: avoid using vma_merge() for new VMAs")
>
> Yup this results in a different code path for brk(), but local testing
> indicated no regression (a prior revision of the series had encountered
> one, so I carefully assessed this, found the bug, and noted no clear
> regression after this - but a lot of variance in the numbers).
>
> >
> > fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> > ---------------- ---------------------------
> > %stddev %change %stddev
> > \ | \
> > 1322908 -5.0% 1256536 aim9.brk_test.ops_per_sec
>
> Unfortunate there's no stddev figure here, and 5% feels borderline on noise
> - as above it'd be great to get some multiple runs going to rule out
> noise. Thanks!

as above mentioned, the reason there is no %stddev here is it's <2%

just list raw data FYI.

for cacded5e42b9609b07b22d80c10

"aim9.brk_test.ops_per_sec": [
1268030.0,
1277110.76,
1226452.45,
1275850.0,
1249628.35,
1242148.6
],


for fc21959f74bc1138

"aim9.brk_test.ops_per_sec": [
1351624.95,
1316322.79,
1330363.33,
1289563.33,
1314100.0,
1335475.48
],


>
> > 201.54 +2.9% 207.44 aim9.time.system_time
> > 97.58 -6.0% 91.75 aim9.time.user_time
> > 0.04 ± 82% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > 0.10 ± 60% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > 0.04 ± 82% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > 0.10 ± 60% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > 8.33e+08 +3.9% 8.654e+08 perf-stat.i.branch-instructions
> > 1.15 -0.1 1.09 perf-stat.i.branch-miss-rate%
> > 12964626 -1.9% 12711922 perf-stat.i.branch-misses
> > 1.11 -7.4% 1.03 perf-stat.i.cpi
> > 3.943e+09 +6.0% 4.18e+09 perf-stat.i.instructions
> > 0.91 +7.9% 0.98 perf-stat.i.ipc
> > 0.29 ± 2% -9.1% 0.27 ± 4% perf-stat.overall.MPKI
> > 1.56 -0.1 1.47 perf-stat.overall.branch-miss-rate%
> > 1.08 -6.8% 1.01 perf-stat.overall.cpi
> > 0.92 +7.2% 0.99 perf-stat.overall.ipc
> > 8.303e+08 +3.9% 8.627e+08 perf-stat.ps.branch-instructions
> > 12931205 -2.0% 12678170 perf-stat.ps.branch-misses
> > 3.93e+09 +6.0% 4.167e+09 perf-stat.ps.instructions
> > 1.184e+12 +6.1% 1.256e+12 perf-stat.total.instructions
> > 7.16 ± 2% -0.4 6.76 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.brk
> > 5.72 ± 2% -0.4 5.35 ± 3% perf-profile.calltrace.cycles-pp.perf_event_mmap_event.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64
> > 6.13 ± 2% -0.3 5.84 ± 3% perf-profile.calltrace.cycles-pp.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 0.83 ± 11% -0.1 0.71 ± 5% perf-profile.calltrace.cycles-pp.__vm_enough_memory.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 0.00 +0.6 0.58 ± 5% perf-profile.calltrace.cycles-pp.mas_leaf_max_gap.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range
> > 16.73 ± 2% +0.6 17.34 perf-profile.calltrace.cycles-pp.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> > 0.00 +0.7 0.66 ± 6% perf-profile.calltrace.cycles-pp.mas_wr_store_type.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags
> > 24.21 +0.7 24.90 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> > 23.33 +0.7 24.05 ± 2% perf-profile.calltrace.cycles-pp.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> > 0.00 +0.8 0.82 ± 4% perf-profile.calltrace.cycles-pp.vma_complete.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > 0.00 +0.9 0.87 ± 5% perf-profile.calltrace.cycles-pp.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags
> > 0.00 +1.1 1.07 ± 9% perf-profile.calltrace.cycles-pp.vma_prepare.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > 0.00 +1.1 1.10 ± 6% perf-profile.calltrace.cycles-pp.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > 0.00 +2.3 2.26 ± 5% perf-profile.calltrace.cycles-pp.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > 0.00 +7.6 7.56 ± 3% perf-profile.calltrace.cycles-pp.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64
> > 0.00 +8.6 8.62 ± 4% perf-profile.calltrace.cycles-pp.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > 7.74 ± 2% -0.4 7.30 ± 4% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
> > 5.81 ± 2% -0.4 5.43 ± 3% perf-profile.children.cycles-pp.perf_event_mmap_event
> > 6.18 ± 2% -0.3 5.88 ± 3% perf-profile.children.cycles-pp.perf_event_mmap
> > 3.93 -0.2 3.73 ± 3% perf-profile.children.cycles-pp.perf_iterate_sb
> > 0.22 ± 29% -0.1 0.08 ± 17% perf-profile.children.cycles-pp.may_expand_vm
> > 0.96 ± 3% -0.1 0.83 ± 4% perf-profile.children.cycles-pp.vma_complete
> > 0.61 ± 14% -0.1 0.52 ± 7% perf-profile.children.cycles-pp.percpu_counter_add_batch
> > 0.15 ± 7% -0.1 0.08 ± 20% perf-profile.children.cycles-pp.brk_test
> > 0.08 ± 11% +0.0 0.12 ± 14% perf-profile.children.cycles-pp.mas_prev_setup
> > 0.17 ± 12% +0.1 0.27 ± 10% perf-profile.children.cycles-pp.mas_wr_store_entry
> > 0.00 +0.2 0.15 ± 11% perf-profile.children.cycles-pp.mas_next_range
> > 0.19 ± 8% +0.2 0.38 ± 10% perf-profile.children.cycles-pp.mas_next_slot
> > 0.34 ± 17% +0.3 0.64 ± 6% perf-profile.children.cycles-pp.mas_prev_slot
> > 23.40 +0.7 24.12 ± 2% perf-profile.children.cycles-pp.__do_sys_brk
> > 0.00 +7.6 7.59 ± 3% perf-profile.children.cycles-pp.vma_expand
> > 0.00 +8.7 8.66 ± 4% perf-profile.children.cycles-pp.vma_merge_new_range
> > 1.61 ± 10% -0.9 0.69 ± 8% perf-profile.self.cycles-pp.do_brk_flags
> > 7.64 ± 2% -0.4 7.20 ± 4% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
> > 0.22 ± 30% -0.1 0.08 ± 17% perf-profile.self.cycles-pp.may_expand_vm
> > 0.57 ± 15% -0.1 0.46 ± 6% perf-profile.self.cycles-pp.percpu_counter_add_batch
> > 0.15 ± 7% -0.1 0.08 ± 20% perf-profile.self.cycles-pp.brk_test
> > 0.20 ± 5% -0.0 0.18 ± 4% perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
> > 0.07 ± 18% +0.0 0.10 ± 18% perf-profile.self.cycles-pp.mas_prev_setup
> > 0.00 +0.1 0.09 ± 12% perf-profile.self.cycles-pp.mas_next_range
> > 0.36 ± 8% +0.1 0.45 ± 6% perf-profile.self.cycles-pp.perf_event_mmap
> > 0.15 ± 13% +0.1 0.25 ± 14% perf-profile.self.cycles-pp.mas_wr_store_entry
> > 0.17 ± 11% +0.2 0.37 ± 11% perf-profile.self.cycles-pp.mas_next_slot
> > 0.34 ± 17% +0.3 0.64 ± 6% perf-profile.self.cycles-pp.mas_prev_slot
> > 0.00 +0.3 0.33 ± 5% perf-profile.self.cycles-pp.vma_merge_new_range
> > 0.00 +0.8 0.81 ± 9% perf-profile.self.cycles-pp.vma_expand
> >
> >
> >
> >
> > Disclaimer:
> > Results have been estimated based on internal Intel analysis and are provided
> > for informational purposes only. Any difference in system hardware or software
> > design or configuration may affect actual performance.
> >
> >
> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
> >
>
> Overall, previously we special-cased brk() to avoid regression, but the
> special-casing is horribly duplicative and bug-prone so, while we can
> revert to doing that again, I'd really, really like to avoid it if we
> possibly can :)