Re: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned

From: kernel test robot
Date: Sun Sep 10 2023 - 11:29:58 EST




Hello,

kernel test robot noticed a -33.6% improvement of autonuma-benchmark.numa02.seconds on:


commit: af46f3c9ca2d16485912f8b9c896ef48bbfe1388 ("[RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned")
url: https://github.com/intel-lab-lkp/linux/commits/Raghavendra-K-T/sched-numa-Move-up-the-access-pid-reset-logic/20230829-141007
base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git 2f88c8e802c8b128a155976631f4eb2ce4f3c805
patch link: https://lore.kernel.org/all/109ca1ea59b9dd6f2daf7b7fbc74e83ae074fbdf.1693287931.git.raghavendra.kt@xxxxxxx/
patch subject: [RFC PATCH V1 5/6] sched/numa: Allow recently accessed VMAs to be scanned

testcase: autonuma-benchmark
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
parameters:

iterations: 4x
test: numa01_THREAD_ALLOC
cpufreq_governor: performance



Details are as below:
-------------------------------------------------------------------------------------------------->


The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230910/202309102311.84b42068-oliver.sang@xxxxxxxxx

=========================================================================================
compiler/cpufreq_governor/iterations/kconfig/rootfs/tbox_group/test/testcase:
gcc-12/performance/4x/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-icl-2sp6/numa01_THREAD_ALLOC/autonuma-benchmark

commit:
167773d1dd ("sched/numa: Increase tasks' access history")
af46f3c9ca ("sched/numa: Allow recently accessed VMAs to be scanned")

167773d1ddb5ffdd af46f3c9ca2d16485912f8b9c89
---------------- ---------------------------
%stddev %change %stddev
\ | \
2.534e+10 ± 10% -13.0% 2.204e+10 ± 7% cpuidle..time
26431366 ± 10% -13.2% 22948978 ± 7% cpuidle..usage
0.15 ± 4% -0.0 0.12 ± 3% mpstat.cpu.all.soft%
2.92 ± 3% +0.4 3.32 ± 4% mpstat.cpu.all.sys%
2243 ± 2% -12.7% 1957 ± 3% uptime.boot
29811 ± 8% -11.1% 26507 ± 6% uptime.idle
5.32 ± 79% -64.2% 1.91 ± 60% perf-sched.sch_delay.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_exc_page_fault
2.70 ± 18% +37.8% 3.72 ± 9% perf-sched.sch_delay.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
0.64 ±137% +26644.2% 169.91 ±220% perf-sched.wait_time.avg.ms.__cond_resched.task_work_run.exit_to_user_mode_loop.exit_to_user_mode_prepare.syscall_exit_to_user_mode
0.08 ± 20% +0.0 0.12 ± 10% perf-profile.children.cycles-pp.terminate_walk
0.10 ± 25% +0.0 0.14 ± 10% perf-profile.children.cycles-pp.wake_up_q
0.06 ± 50% +0.0 0.10 ± 10% perf-profile.children.cycles-pp.vfs_readlink
0.15 ± 36% +0.1 0.22 ± 13% perf-profile.children.cycles-pp.readlink
1.31 ± 19% +0.4 1.69 ± 12% perf-profile.children.cycles-pp.unmap_vmas
2.46 ± 19% +0.5 2.99 ± 4% perf-profile.children.cycles-pp.exit_mmap
311653 ± 10% -23.7% 237884 ± 9% turbostat.C1E
26018024 ± 10% -13.1% 22597563 ± 7% turbostat.C6
6.41 ± 9% -13.6% 5.54 ± 8% turbostat.CPU%c1
2.47 ± 11% +36.0% 3.36 ± 6% turbostat.CPU%c6
2.881e+08 ± 2% -12.8% 2.513e+08 ± 3% turbostat.IRQ
212.86 +2.8% 218.84 turbostat.RAMWatt
341.49 -4.1% 327.42 ± 2% autonuma-benchmark.numa01.seconds
186.67 ± 6% -27.1% 136.12 ± 7% autonuma-benchmark.numa01_THREAD_ALLOC.seconds
21.17 ± 7% -33.6% 14.05 autonuma-benchmark.numa02.seconds
2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time
2200 ± 2% -13.0% 1913 ± 3% autonuma-benchmark.time.elapsed_time.max
1159380 ± 2% -12.0% 1019969 ± 3% autonuma-benchmark.time.involuntary_context_switches
3363550 -5.0% 3194802 autonuma-benchmark.time.minor_page_faults
243046 ± 2% -13.3% 210725 ± 3% autonuma-benchmark.time.user_time
7494239 -6.8% 6984234 proc-vmstat.numa_hit
118829 ± 6% +13.7% 135136 ± 6% proc-vmstat.numa_huge_pte_updates
6207618 -8.4% 5686795 ± 2% proc-vmstat.numa_local
8834573 ± 3% +20.2% 10616944 ± 4% proc-vmstat.numa_pages_migrated
61094857 ± 6% +13.6% 69409875 ± 6% proc-vmstat.numa_pte_updates
8602789 -9.0% 7827793 ± 2% proc-vmstat.pgfault
8834573 ± 3% +20.2% 10616944 ± 4% proc-vmstat.pgmigrate_success
371818 -10.1% 334391 ± 2% proc-vmstat.pgreuse
17200 ± 3% +20.3% 20686 ± 4% proc-vmstat.thp_migration_success
16401792 ± 2% -12.7% 14322816 ± 3% proc-vmstat.unevictable_pgs_scanned
1.606e+08 ± 2% -13.8% 1.385e+08 ± 3% sched_debug.cfs_rq:/.avg_vruntime.avg
1.666e+08 ± 2% -14.0% 1.433e+08 ± 3% sched_debug.cfs_rq:/.avg_vruntime.max
1.364e+08 ± 2% -11.7% 1.204e+08 ± 3% sched_debug.cfs_rq:/.avg_vruntime.min
4795327 ± 7% -17.5% 3956991 ± 7% sched_debug.cfs_rq:/.avg_vruntime.stddev
1.606e+08 ± 2% -13.8% 1.385e+08 ± 3% sched_debug.cfs_rq:/.min_vruntime.avg
1.666e+08 ± 2% -14.0% 1.433e+08 ± 3% sched_debug.cfs_rq:/.min_vruntime.max
1.364e+08 ± 2% -11.7% 1.204e+08 ± 3% sched_debug.cfs_rq:/.min_vruntime.min
4795327 ± 7% -17.5% 3956991 ± 7% sched_debug.cfs_rq:/.min_vruntime.stddev
364.96 ± 6% +16.6% 425.70 ± 5% sched_debug.cfs_rq:/.util_est_enqueued.avg
1099114 -13.0% 956021 ± 2% sched_debug.cpu.clock.avg
1099477 -13.0% 956344 ± 2% sched_debug.cpu.clock.max
1098702 -13.0% 955643 ± 2% sched_debug.cpu.clock.min
1080712 -13.0% 940415 ± 2% sched_debug.cpu.clock_task.avg
1085309 -13.1% 943557 ± 2% sched_debug.cpu.clock_task.max
1064613 -13.0% 925993 ± 2% sched_debug.cpu.clock_task.min
28890 ± 3% -11.7% 25504 ± 3% sched_debug.cpu.curr->pid.avg
35200 -11.0% 31344 sched_debug.cpu.curr->pid.max
862245 ± 3% -8.7% 786984 sched_debug.cpu.max_idle_balance_cost.max
74019 ± 9% -28.2% 53158 ± 7% sched_debug.cpu.max_idle_balance_cost.stddev
15507 -11.9% 13667 ± 2% sched_debug.cpu.nr_switches.avg
57616 ± 6% -19.0% 46642 ± 8% sched_debug.cpu.nr_switches.max
8460 ± 6% -12.9% 7368 ± 5% sched_debug.cpu.nr_switches.stddev
1098689 -13.0% 955631 ± 2% sched_debug.cpu_clk
1097964 -13.0% 954907 ± 2% sched_debug.ktime
0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_migratory.avg
0.03 +15.0% 0.03 ± 2% sched_debug.rt_rq:.rt_nr_migratory.max
0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_migratory.stddev
0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_running.avg
0.03 +15.0% 0.03 ± 2% sched_debug.rt_rq:.rt_nr_running.max
0.00 +15.0% 0.00 ± 2% sched_debug.rt_rq:.rt_nr_running.stddev
1099511 -13.0% 956501 ± 2% sched_debug.sched_clk
1162 ± 2% +15.2% 1339 ± 3% perf-stat.i.MPKI
1.656e+08 +3.6% 1.716e+08 perf-stat.i.branch-instructions
0.95 ± 4% +0.1 1.03 perf-stat.i.branch-miss-rate%
1538367 ± 6% +11.0% 1707146 ± 2% perf-stat.i.branch-misses
6.327e+08 ± 3% +18.7% 7.513e+08 ± 4% perf-stat.i.cache-misses
8.282e+08 ± 2% +15.2% 9.542e+08 ± 3% perf-stat.i.cache-references
658.12 ± 3% -11.4% 582.98 ± 6% perf-stat.i.cycles-between-cache-misses
2.201e+08 +2.8% 2.263e+08 perf-stat.i.dTLB-loads
579771 +0.9% 584915 perf-stat.i.dTLB-store-misses
1.122e+08 +1.4% 1.138e+08 perf-stat.i.dTLB-stores
8.278e+08 +3.1% 8.538e+08 perf-stat.i.instructions
13.98 ± 2% +14.3% 15.98 ± 3% perf-stat.i.metric.M/sec
3797 +4.3% 3958 perf-stat.i.minor-faults
258749 +8.0% 279391 ± 2% perf-stat.i.node-load-misses
261169 ± 2% +7.4% 280417 ± 5% perf-stat.i.node-loads
40.91 ± 3% -3.0 37.89 ± 3% perf-stat.i.node-store-miss-rate%
3.841e+08 ± 6% +27.6% 4.902e+08 ± 7% perf-stat.i.node-stores
3797 +4.3% 3958 perf-stat.i.page-faults
998.24 ± 2% +11.8% 1116 ± 2% perf-stat.overall.MPKI
463.91 -3.2% 448.99 perf-stat.overall.cpi
604.23 ± 3% -15.9% 508.08 ± 4% perf-stat.overall.cycles-between-cache-misses
0.00 +3.3% 0.00 perf-stat.overall.ipc
39.20 ± 5% -4.5 34.70 ± 6% perf-stat.overall.node-store-miss-rate%
1.636e+08 +3.8% 1.698e+08 perf-stat.ps.branch-instructions
1499760 ± 6% +11.1% 1665855 ± 2% perf-stat.ps.branch-misses
6.296e+08 ± 3% +19.0% 7.489e+08 ± 4% perf-stat.ps.cache-misses
8.178e+08 ± 2% +15.5% 9.447e+08 ± 3% perf-stat.ps.cache-references
2.18e+08 +2.9% 2.244e+08 perf-stat.ps.dTLB-loads
578148 +0.9% 583328 perf-stat.ps.dTLB-store-misses
1.117e+08 +1.4% 1.132e+08 perf-stat.ps.dTLB-stores
8.192e+08 +3.3% 8.46e+08 perf-stat.ps.instructions
3744 +4.3% 3906 perf-stat.ps.minor-faults
255974 +8.2% 276924 ± 2% perf-stat.ps.node-load-misses
263796 ± 2% +7.7% 284110 ± 5% perf-stat.ps.node-loads
3.82e+08 ± 6% +27.7% 4.879e+08 ± 7% perf-stat.ps.node-stores
3744 +4.3% 3906 perf-stat.ps.page-faults
1.805e+12 ± 2% -10.1% 1.622e+12 ± 2% perf-stat.total.instructions




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki