Re: [linus:master] [mm] cacded5e42: aim9.brk_test.ops_per_sec -5.0% regression

From: Lorenzo Stoakes
Date: Tue Oct 08 2024 - 04:56:35 EST


On Tue, Oct 08, 2024 at 04:31:59PM +0800, Oliver Sang wrote:
> hi, Lorenzo Stoakes,
>
> sorry for late, we are in holidays last week.
>
> On Mon, Sep 30, 2024 at 09:21:52AM +0100, Lorenzo Stoakes wrote:
> > On Mon, Sep 30, 2024 at 10:21:27AM GMT, kernel test robot wrote:
> > >
> > >
> > > Hello,
> > >
> > > kernel test robot noticed a -5.0% regression of aim9.brk_test.ops_per_sec on:
> > >
> > >
> > > commit: cacded5e42b9609b07b22d80c10f0076d439f7d1 ("mm: avoid using vma_merge() for new VMAs")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > testcase: aim9
> > > test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 64G memory
> >
> > Hm, quite an old microarchitecture no?
> >
> > Would it be possible to try this on a range of uarch's, especially more
> > recent noes, with some repeated runs to rule out statistical noise? Much
> > appreciated!
>
> we run this test on below platforms, and observed similar regression.
> one thing I want to mention is for performance tests, we run one commit at least
> 6 times. for this aim9 test, the data is quite stable, so there is no %stddev
> value in our table. we won't show this value if it's <2%

Thanks, though I do suggest going forward it's worth adding the number even
if it's <2% or highlighting that, I found that quite misleading.

Also might I suggest reporting the most recent uarch first? As this seeming
to be ivy bridge only delayed my responding to this (not to sound
ungrateful for the report, which is very useful, but it'd be great if you
guys could test in -next, as this was there for weeks with no apparent
issues).

I will look into this now, if I provide patches would you be able to test
them using the same boxes? It'd be much appreciated!

Thanks, Lorenzo

>
> (1)
>
> model: Granite Rapids
> nr_node: 1
> nr_cpu: 240
> memory: 192G
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-gnr-1ap1/brk_test/aim9/300s
>
> fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 3220697 -6.0% 3028867 aim9.brk_test.ops_per_sec
>
>
> (2)
>
> model: Emerald Rapids
> nr_node: 4
> nr_cpu: 256
> memory: 256G
> brand: INTEL(R) XEON(R) PLATINUM 8592+
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-emr-2sp1/brk_test/aim9/300s
>
> fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 3669298 -6.5% 3430070 aim9.brk_test.ops_per_sec
>
>
> (3)
>
> model: Sapphire Rapids
> nr_node: 2
> nr_cpu: 224
> memory: 512G
> brand: Intel(R) Xeon(R) Platinum 8480CTDX
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-spr-2sp4/brk_test/aim9/300s
>
> fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 3540976 -6.4% 3314159 aim9.brk_test.ops_per_sec
>
>
> (4)
>
> model: Ice Lake
> nr_node: 2
> nr_cpu: 64
> memory: 256G
> brand: Intel(R) Xeon(R) Gold 6346 CPU @ 3.10GHz
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-icl-2sp9/brk_test/aim9/300s
>
> fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 2667734 -5.6% 2518021 aim9.brk_test.ops_per_sec
>
>
> >
> > > parameters:
> > >
> > > testtime: 300s
> > > test: brk_test
> > > cpufreq_governor: performance
> > >
> > >
> > >
> > >
> > > If you fix the issue in a separate patch/commit (i.e. not just a new version of
> > > the same patch/commit), kindly add following tags
> > > | Reported-by: kernel test robot <oliver.sang@xxxxxxxxx>
> > > | Closes: https://lore.kernel.org/oe-lkp/202409301043.629bea78-oliver.sang@xxxxxxxxx
> > >
> > >
> > > Details are as below:
> > > -------------------------------------------------------------------------------------------------->
> > >
> > >
> > > The kernel config and materials to reproduce are available at:
> > > https://download.01.org/0day-ci/archive/20240930/202409301043.629bea78-oliver.sang@xxxxxxxxx
> > >
> > > =========================================================================================
> > > compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
> > > gcc-12/performance/x86_64-rhel-8.3/debian-12-x86_64-20240206.cgz/lkp-ivb-2ep2/brk_test/aim9/300s
> > >
> > > commit:
> > > fc21959f74 ("mm: abstract vma_expand() to use vma_merge_struct")
> > > cacded5e42 ("mm: avoid using vma_merge() for new VMAs")
> >
> > Yup this results in a different code path for brk(), but local testing
> > indicated no regression (a prior revision of the series had encountered
> > one, so I carefully assessed this, found the bug, and noted no clear
> > regression after this - but a lot of variance in the numbers).
> >
> > >
> > > fc21959f74bc1138 cacded5e42b9609b07b22d80c10
> > > ---------------- ---------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 1322908 -5.0% 1256536 aim9.brk_test.ops_per_sec
> >
> > Unfortunate there's no stddev figure here, and 5% feels borderline on noise
> > - as above it'd be great to get some multiple runs going to rule out
> > noise. Thanks!
>
> as above mentioned, the reason there is no %stddev here is it's <2%
>
> just list raw data FYI.
>
> for cacded5e42b9609b07b22d80c10
>
> "aim9.brk_test.ops_per_sec": [
> 1268030.0,
> 1277110.76,
> 1226452.45,
> 1275850.0,
> 1249628.35,
> 1242148.6
> ],
>
>
> for fc21959f74bc1138
>
> "aim9.brk_test.ops_per_sec": [
> 1351624.95,
> 1316322.79,
> 1330363.33,
> 1289563.33,
> 1314100.0,
> 1335475.48
> ],
>
>
> >
> > > 201.54 +2.9% 207.44 aim9.time.system_time
> > > 97.58 -6.0% 91.75 aim9.time.user_time
> > > 0.04 ± 82% -100.0% 0.00 perf-sched.sch_delay.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > > 0.10 ± 60% -100.0% 0.00 perf-sched.sch_delay.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > > 0.04 ± 82% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > > 0.10 ± 60% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.down_write.do_brk_flags.__do_sys_brk.do_syscall_64
> > > 8.33e+08 +3.9% 8.654e+08 perf-stat.i.branch-instructions
> > > 1.15 -0.1 1.09 perf-stat.i.branch-miss-rate%
> > > 12964626 -1.9% 12711922 perf-stat.i.branch-misses
> > > 1.11 -7.4% 1.03 perf-stat.i.cpi
> > > 3.943e+09 +6.0% 4.18e+09 perf-stat.i.instructions
> > > 0.91 +7.9% 0.98 perf-stat.i.ipc
> > > 0.29 ± 2% -9.1% 0.27 ± 4% perf-stat.overall.MPKI
> > > 1.56 -0.1 1.47 perf-stat.overall.branch-miss-rate%
> > > 1.08 -6.8% 1.01 perf-stat.overall.cpi
> > > 0.92 +7.2% 0.99 perf-stat.overall.ipc
> > > 8.303e+08 +3.9% 8.627e+08 perf-stat.ps.branch-instructions
> > > 12931205 -2.0% 12678170 perf-stat.ps.branch-misses
> > > 3.93e+09 +6.0% 4.167e+09 perf-stat.ps.instructions
> > > 1.184e+12 +6.1% 1.256e+12 perf-stat.total.instructions
> > > 7.16 ± 2% -0.4 6.76 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSRETQ_unsafe_stack.brk
> > > 5.72 ± 2% -0.4 5.35 ± 3% perf-profile.calltrace.cycles-pp.perf_event_mmap_event.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64
> > > 6.13 ± 2% -0.3 5.84 ± 3% perf-profile.calltrace.cycles-pp.perf_event_mmap.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 0.83 ± 11% -0.1 0.71 ± 5% perf-profile.calltrace.cycles-pp.__vm_enough_memory.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 0.00 +0.6 0.58 ± 5% perf-profile.calltrace.cycles-pp.mas_leaf_max_gap.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range
> > > 16.73 ± 2% +0.6 17.34 perf-profile.calltrace.cycles-pp.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> > > 0.00 +0.7 0.66 ± 6% perf-profile.calltrace.cycles-pp.mas_wr_store_type.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags
> > > 24.21 +0.7 24.90 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> > > 23.33 +0.7 24.05 ± 2% perf-profile.calltrace.cycles-pp.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe.brk
> > > 0.00 +0.8 0.82 ± 4% perf-profile.calltrace.cycles-pp.vma_complete.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > > 0.00 +0.9 0.87 ± 5% perf-profile.calltrace.cycles-pp.mas_update_gap.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags
> > > 0.00 +1.1 1.07 ± 9% perf-profile.calltrace.cycles-pp.vma_prepare.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > > 0.00 +1.1 1.10 ± 6% perf-profile.calltrace.cycles-pp.mas_preallocate.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > > 0.00 +2.3 2.26 ± 5% perf-profile.calltrace.cycles-pp.mas_store_prealloc.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk
> > > 0.00 +7.6 7.56 ± 3% perf-profile.calltrace.cycles-pp.vma_expand.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64
> > > 0.00 +8.6 8.62 ± 4% perf-profile.calltrace.cycles-pp.vma_merge_new_range.do_brk_flags.__do_sys_brk.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 7.74 ± 2% -0.4 7.30 ± 4% perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
> > > 5.81 ± 2% -0.4 5.43 ± 3% perf-profile.children.cycles-pp.perf_event_mmap_event
> > > 6.18 ± 2% -0.3 5.88 ± 3% perf-profile.children.cycles-pp.perf_event_mmap
> > > 3.93 -0.2 3.73 ± 3% perf-profile.children.cycles-pp.perf_iterate_sb
> > > 0.22 ± 29% -0.1 0.08 ± 17% perf-profile.children.cycles-pp.may_expand_vm
> > > 0.96 ± 3% -0.1 0.83 ± 4% perf-profile.children.cycles-pp.vma_complete
> > > 0.61 ± 14% -0.1 0.52 ± 7% perf-profile.children.cycles-pp.percpu_counter_add_batch
> > > 0.15 ± 7% -0.1 0.08 ± 20% perf-profile.children.cycles-pp.brk_test
> > > 0.08 ± 11% +0.0 0.12 ± 14% perf-profile.children.cycles-pp.mas_prev_setup
> > > 0.17 ± 12% +0.1 0.27 ± 10% perf-profile.children.cycles-pp.mas_wr_store_entry
> > > 0.00 +0.2 0.15 ± 11% perf-profile.children.cycles-pp.mas_next_range
> > > 0.19 ± 8% +0.2 0.38 ± 10% perf-profile.children.cycles-pp.mas_next_slot
> > > 0.34 ± 17% +0.3 0.64 ± 6% perf-profile.children.cycles-pp.mas_prev_slot
> > > 23.40 +0.7 24.12 ± 2% perf-profile.children.cycles-pp.__do_sys_brk
> > > 0.00 +7.6 7.59 ± 3% perf-profile.children.cycles-pp.vma_expand
> > > 0.00 +8.7 8.66 ± 4% perf-profile.children.cycles-pp.vma_merge_new_range
> > > 1.61 ± 10% -0.9 0.69 ± 8% perf-profile.self.cycles-pp.do_brk_flags
> > > 7.64 ± 2% -0.4 7.20 ± 4% perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
> > > 0.22 ± 30% -0.1 0.08 ± 17% perf-profile.self.cycles-pp.may_expand_vm
> > > 0.57 ± 15% -0.1 0.46 ± 6% perf-profile.self.cycles-pp.percpu_counter_add_batch
> > > 0.15 ± 7% -0.1 0.08 ± 20% perf-profile.self.cycles-pp.brk_test
> > > 0.20 ± 5% -0.0 0.18 ± 4% perf-profile.self.cycles-pp.anon_vma_interval_tree_insert
> > > 0.07 ± 18% +0.0 0.10 ± 18% perf-profile.self.cycles-pp.mas_prev_setup
> > > 0.00 +0.1 0.09 ± 12% perf-profile.self.cycles-pp.mas_next_range
> > > 0.36 ± 8% +0.1 0.45 ± 6% perf-profile.self.cycles-pp.perf_event_mmap
> > > 0.15 ± 13% +0.1 0.25 ± 14% perf-profile.self.cycles-pp.mas_wr_store_entry
> > > 0.17 ± 11% +0.2 0.37 ± 11% perf-profile.self.cycles-pp.mas_next_slot
> > > 0.34 ± 17% +0.3 0.64 ± 6% perf-profile.self.cycles-pp.mas_prev_slot
> > > 0.00 +0.3 0.33 ± 5% perf-profile.self.cycles-pp.vma_merge_new_range
> > > 0.00 +0.8 0.81 ± 9% perf-profile.self.cycles-pp.vma_expand
> > >
> > >
> > >
> > >
> > > Disclaimer:
> > > Results have been estimated based on internal Intel analysis and are provided
> > > for informational purposes only. Any difference in system hardware or software
> > > design or configuration may affect actual performance.
> > >
> > >
> > > --
> > > 0-DAY CI Kernel Test Service
> > > https://github.com/intel/lkp-tests/wiki
> > >
> >
> > Overall, previously we special-cased brk() to avoid regression, but the
> > special-casing is horribly duplicative and bug-prone so, while we can
> > revert to doing that again, I'd really, really like to avoid it if we
> > possibly can :)