[linux-next:master] 253ca8678d: lmbench3.Select.100tcp.latency.us -5.0% improvement

From: kernel test robot
Date: Thu Dec 21 2023 - 21:54:05 EST




Hello,

this commit fixes the
"[linus:master] [file] 0ede61d858: will-it-scale.per_thread_ops -2.9% regression"
we reported in
https://lore.kernel.org/oe-lkp/202311201406.2022ca3f-oliver.sang@xxxxxxxxx/

in our tests, besides the improvment in will-it-scale tests, we also noticed
the improvement in lmbench3 latency tests. so just report as below FYI.



kernel test robot noticed a -5.0% improvement of lmbench3.Select.100tcp.latency.us on:


commit: 253ca8678d30bcf94410b54476fc1e0f1627a137 ("Improve __fget_files_rcu() code generation (and thus __fget_light())")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

testcase: lmbench3
test machine: 48 threads 2 sockets Intel(R) Xeon(R) CPU E5-2697 v2 @ 2.70GHz (Ivy Bridge-EP) with 112G memory
parameters:

test_memory_size: 50%
nr_threads: 50%
mode: development
test: SELECT
cpufreq_governor: performance


In addition to that, the commit also has significant impact on the following tests:

+------------------+----------------------------------------------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_process_ops 10.3% improvement |
| test machine | 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory |
| test parameters | cpufreq_governor=performance |
| | mode=process |
| | nr_task=100% |
| | test=poll2 |
+------------------+----------------------------------------------------------------------------------------------------+




Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20231222/202312221056.da0e7f9-oliver.sang@xxxxxxxxx

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_threads/rootfs/tbox_group/test/test_memory_size/testcase:
gcc-12/performance/x86_64-rhel-8.3/development/50%/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/SELECT/50%/lmbench3

commit:
7cb537b6f6 ("file: massage cleanup of files that failed to open")
253ca8678d ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

7cb537b6f6d7d652 253ca8678d30bcf94410b54476f
---------------- ---------------------------
%stddev %change %stddev
\ | \
1.78 -9.8% 1.61 lmbench3.Select.100fd.latency.us
5.70 -5.0% 5.41 lmbench3.Select.100tcp.latency.us
12.09 ± 36% -12.1 0.00 perf-profile.calltrace.cycles-pp.__fget_light.do_select.core_sys_select.kern_select.__x64_sys_select
0.05 ±299% +14.9 14.97 ± 51% perf-profile.calltrace.cycles-pp.__fdget.do_select.core_sys_select.kern_select.__x64_sys_select
12.09 ± 36% -12.1 0.00 perf-profile.children.cycles-pp.__fget_light
0.36 ± 42% +14.6 14.98 ± 51% perf-profile.children.cycles-pp.__fdget
12.05 ± 36% -12.1 0.00 perf-profile.self.cycles-pp.__fget_light
0.31 ± 42% +14.6 14.91 ± 52% perf-profile.self.cycles-pp.__fdget
0.19 ± 2% +0.0 0.20 ± 3% perf-stat.i.dTLB-store-miss-rate%
1585715 ± 8% +93.4% 3067285 ± 30% perf-stat.i.iTLB-load-misses
0.17 ± 2% +0.0 0.19 ± 3% perf-stat.overall.dTLB-store-miss-rate%
88.15 ± 5% +4.9 93.07 perf-stat.overall.iTLB-load-miss-rate%
48830 ± 8% -45.0% 26871 ± 25% perf-stat.overall.instructions-per-iTLB-miss
1.41 -1.8% 1.38 perf-stat.overall.ipc
1573086 ± 8% +93.7% 3047643 ± 30% perf-stat.ps.iTLB-load-misses


***************************************************************************************************
lkp-cpl-4sp2: 224 threads 4 sockets Intel(R) Xeon(R) Platinum 8380H CPU @ 2.90GHz (Cooper Lake) with 192G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-12/performance/x86_64-rhel-8.3/process/100%/debian-11.1-x86_64-20220510.cgz/lkp-cpl-4sp2/poll2/will-it-scale

commit:
7cb537b6f6 ("file: massage cleanup of files that failed to open")
253ca8678d ("Improve __fget_files_rcu() code generation (and thus __fget_light())")

7cb537b6f6d7d652 253ca8678d30bcf94410b54476f
---------------- ---------------------------
%stddev %change %stddev
\ | \
685.00 ± 5% +62.3% 1111 ± 13% perf-c2c.HITM.local
0.04 ±187% +482.9% 0.21 ± 50% perf-sched.sch_delay.avg.ms.schedule_hrtimeout_range_clock.do_poll.constprop.0.do_sys_poll
136406 +2.0% 139095 proc-vmstat.nr_active_anon
136406 +2.0% 139095 proc-vmstat.nr_zone_active_anon
98393191 +10.3% 1.085e+08 will-it-scale.224.processes
439254 +10.3% 484377 will-it-scale.per_process_ops
98393191 +10.3% 1.085e+08 will-it-scale.workload
0.00 +28.2% 0.00 ± 17% perf-stat.i.MPKI
2.226e+11 -2.2% 2.178e+11 perf-stat.i.branch-instructions
0.28 +0.0 0.30 perf-stat.i.branch-miss-rate%
6.155e+08 +7.4% 6.608e+08 perf-stat.i.branch-misses
12.91 -3.3 9.62 ± 13% perf-stat.i.cache-miss-rate%
1955843 +22.9% 2402856 ± 17% perf-stat.i.cache-misses
15946481 +59.2% 25391906 ± 9% perf-stat.i.cache-references
0.59 +5.0% 0.62 perf-stat.i.cpi
408471 -17.9% 335390 ± 14% perf-stat.i.cycles-between-cache-misses
2.901e+11 -4.0% 2.784e+11 perf-stat.i.dTLB-loads
0.00 ± 9% +0.0 0.00 ± 10% perf-stat.i.dTLB-store-miss-rate%
1.814e+11 -12.6% 1.585e+11 perf-stat.i.dTLB-stores
26765498 +9.7% 29360826 perf-stat.i.iTLB-load-misses
1.23e+12 -4.4% 1.176e+12 perf-stat.i.instructions
46105 -12.9% 40163 perf-stat.i.instructions-per-iTLB-miss
1.69 -4.8% 1.61 perf-stat.i.ipc
1.30 -4.1% 1.24 perf-stat.i.metric.G/sec
75.67 +56.5% 118.40 ± 9% perf-stat.i.metric.K/sec
1802 -6.9% 1679 perf-stat.i.metric.M/sec
91.19 +1.9 93.14 perf-stat.i.node-load-miss-rate%
603847 +29.4% 781631 ± 13% perf-stat.i.node-load-misses
0.00 ± 44% +54.2% 0.00 ± 17% perf-stat.overall.MPKI
0.23 ± 44% +0.1 0.30 perf-stat.overall.branch-miss-rate%
0.49 ± 44% +26.0% 0.62 perf-stat.overall.cpi
0.00 ± 46% +0.0 0.00 ± 10% perf-stat.overall.dTLB-store-miss-rate%
73.34 ± 44% +18.0 91.29 perf-stat.overall.node-load-miss-rate%
5.111e+08 ± 44% +28.9% 6.586e+08 perf-stat.ps.branch-misses
1626781 ± 44% +47.4% 2397620 ± 17% perf-stat.ps.cache-misses
13269755 ± 44% +91.5% 25415998 ± 9% perf-stat.ps.cache-references
22231799 ± 44% +31.6% 29255242 perf-stat.ps.iTLB-load-misses
501267 ± 44% +55.4% 779219 ± 13% perf-stat.ps.node-load-misses
16030 ± 45% +33.6% 21409 ± 6% perf-stat.ps.node-stores
47.56 -47.6 0.00 perf-profile.calltrace.cycles-pp.__fget_light.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
67.41 -2.9 64.56 perf-profile.calltrace.cycles-pp.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
87.35 -1.2 86.15 perf-profile.calltrace.cycles-pp.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
87.96 -1.1 86.82 perf-profile.calltrace.cycles-pp.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
88.69 -1.1 87.62 perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__poll
89.02 -1.1 87.97 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__poll
91.89 -0.8 91.12 perf-profile.calltrace.cycles-pp.__poll
0.81 +0.0 0.85 perf-profile.calltrace.cycles-pp.__check_heap_object.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64
0.64 +0.1 0.69 ± 2% perf-profile.calltrace.cycles-pp.__kmem_cache_free.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.68 +0.1 0.74 perf-profile.calltrace.cycles-pp.kfree.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.26 +0.1 1.32 perf-profile.calltrace.cycles-pp.check_heap_object.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64
0.84 +0.1 0.94 ± 2% perf-profile.calltrace.cycles-pp.__virt_addr_valid.check_heap_object.__check_object_size.do_sys_poll.__x64_sys_poll
1.53 +0.1 1.67 perf-profile.calltrace.cycles-pp.__kmem_cache_alloc_node.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64
1.82 +0.2 1.98 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64.__poll
2.60 +0.2 2.76 perf-profile.calltrace.cycles-pp.__check_object_size.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.91 +0.2 2.09 perf-profile.calltrace.cycles-pp.__kmalloc.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
2.44 ± 2% +0.2 2.62 perf-profile.calltrace.cycles-pp.rep_movs_alternative._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64
3.86 +0.3 4.20 perf-profile.calltrace.cycles-pp._copy_from_user.do_sys_poll.__x64_sys_poll.do_syscall_64.entry_SYSCALL_64_after_hwframe
7.94 +0.8 8.70 perf-profile.calltrace.cycles-pp.testcase
3.60 +42.4 45.95 perf-profile.calltrace.cycles-pp.__fdget.do_poll.do_sys_poll.__x64_sys_poll.do_syscall_64
45.80 -45.8 0.00 perf-profile.children.cycles-pp.__fget_light
69.22 -2.7 66.50 perf-profile.children.cycles-pp.do_poll
87.48 -1.2 86.29 perf-profile.children.cycles-pp.do_sys_poll
87.99 -1.1 86.85 perf-profile.children.cycles-pp.__x64_sys_poll
88.74 -1.1 87.67 perf-profile.children.cycles-pp.do_syscall_64
89.06 -1.0 88.01 perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
91.99 -0.8 91.23 perf-profile.children.cycles-pp.__poll
0.08 +0.0 0.09 ± 4% perf-profile.children.cycles-pp.is_vmalloc_addr
0.14 ± 2% +0.0 0.16 ± 3% perf-profile.children.cycles-pp.exit_to_user_mode_prepare
0.24 +0.0 0.26 perf-profile.children.cycles-pp.memcg_slab_post_alloc_hook
0.16 ± 3% +0.0 0.17 perf-profile.children.cycles-pp.rcu_all_qs
0.13 ± 3% +0.0 0.14 ± 2% perf-profile.children.cycles-pp.kmalloc_slab
0.12 ± 3% +0.0 0.14 ± 3% perf-profile.children.cycles-pp.syscall_enter_from_user_mode
0.21 ± 2% +0.0 0.24 perf-profile.children.cycles-pp.check_stack_object
0.24 ± 2% +0.0 0.27 perf-profile.children.cycles-pp.poll@plt
0.15 ± 2% +0.0 0.18 ± 2% perf-profile.children.cycles-pp.entry_SYSCALL_64_safe_stack
0.24 ± 2% +0.0 0.26 perf-profile.children.cycles-pp.__cond_resched
0.36 +0.0 0.40 ± 2% perf-profile.children.cycles-pp.syscall_exit_to_user_mode
0.81 +0.0 0.86 perf-profile.children.cycles-pp.__check_heap_object
0.48 +0.0 0.53 perf-profile.children.cycles-pp.syscall_return_via_sysret
0.65 +0.1 0.70 perf-profile.children.cycles-pp.__kmem_cache_free
0.68 +0.1 0.74 perf-profile.children.cycles-pp.kfree
0.70 +0.1 0.76 perf-profile.children.cycles-pp.entry_SYSRETQ_unsafe_stack
1.32 +0.1 1.39 perf-profile.children.cycles-pp.check_heap_object
1.14 +0.1 1.23 perf-profile.children.cycles-pp.entry_SYSCALL_64
0.85 +0.1 0.96 perf-profile.children.cycles-pp.__virt_addr_valid
1.60 +0.1 1.76 perf-profile.children.cycles-pp.__kmem_cache_alloc_node
2.76 +0.2 2.94 perf-profile.children.cycles-pp.__check_object_size
1.94 +0.2 2.13 perf-profile.children.cycles-pp.__kmalloc
2.48 ± 2% +0.2 2.67 perf-profile.children.cycles-pp.rep_movs_alternative
4.09 +0.4 4.45 perf-profile.children.cycles-pp._copy_from_user
8.04 +0.8 8.81 perf-profile.children.cycles-pp.testcase
3.58 +40.5 44.04 perf-profile.children.cycles-pp.__fdget
43.81 -43.8 0.00 perf-profile.self.cycles-pp.__fget_light
0.40 -0.0 0.38 perf-profile.self.cycles-pp.check_heap_object
0.15 +0.0 0.16 perf-profile.self.cycles-pp.poll_select_set_timeout
0.06 +0.0 0.07 perf-profile.self.cycles-pp.is_vmalloc_addr
0.10 ± 4% +0.0 0.12 ± 4% perf-profile.self.cycles-pp.exit_to_user_mode_prepare
0.14 ± 2% +0.0 0.15 ± 2% perf-profile.self.cycles-pp.rcu_all_qs
0.11 ± 4% +0.0 0.13 ± 2% perf-profile.self.cycles-pp.kmalloc_slab
0.11 +0.0 0.12 ± 4% perf-profile.self.cycles-pp.syscall_enter_from_user_mode
0.21 +0.0 0.23 ± 2% perf-profile.self.cycles-pp.memcg_slab_post_alloc_hook
0.14 ± 3% +0.0 0.16 perf-profile.self.cycles-pp.poll@plt
0.18 ± 2% +0.0 0.20 perf-profile.self.cycles-pp.check_stack_object
0.15 ± 2% +0.0 0.17 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_safe_stack
0.22 ± 2% +0.0 0.24 ± 2% perf-profile.self.cycles-pp.__kmalloc
0.32 ± 2% +0.0 0.34 ± 2% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_hwframe
0.25 +0.0 0.28 perf-profile.self.cycles-pp.do_syscall_64
0.43 +0.0 0.47 perf-profile.self.cycles-pp.__check_object_size
0.45 +0.0 0.48 perf-profile.self.cycles-pp.entry_SYSCALL_64
0.36 +0.0 0.40 ± 2% perf-profile.self.cycles-pp.__x64_sys_poll
0.81 +0.0 0.85 perf-profile.self.cycles-pp.__check_heap_object
0.48 +0.0 0.52 perf-profile.self.cycles-pp.syscall_return_via_sysret
0.65 +0.1 0.70 perf-profile.self.cycles-pp.__kmem_cache_free
0.68 +0.1 0.74 perf-profile.self.cycles-pp.entry_SYSRETQ_unsafe_stack
0.66 +0.1 0.72 perf-profile.self.cycles-pp.kfree
0.81 +0.1 0.91 ± 2% perf-profile.self.cycles-pp.__virt_addr_valid
1.05 ± 4% +0.1 1.16 ± 3% perf-profile.self.cycles-pp.__poll
1.13 +0.1 1.24 perf-profile.self.cycles-pp.__kmem_cache_alloc_node
1.73 +0.2 1.90 perf-profile.self.cycles-pp._copy_from_user
2.33 ± 2% +0.2 2.52 perf-profile.self.cycles-pp.rep_movs_alternative
8.10 +0.7 8.80 perf-profile.self.cycles-pp.do_sys_poll
7.94 +0.8 8.69 perf-profile.self.cycles-pp.testcase
23.27 +1.0 24.26 perf-profile.self.cycles-pp.do_poll
1.79 +40.1 41.93 perf-profile.self.cycles-pp.__fdget





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki