Re: [lkp] [mremap] 5d1904204c: will-it-scale.per_thread_ops -13.1% regression

From: Aaron Lu
Date: Mon Nov 28 2016 - 03:30:12 EST


+linux-mm

On Mon, Nov 28, 2016 at 02:21:53AM +0800, kernel test robot wrote:
>
> Greeting,

Thanks for the report.

>
> FYI, we noticed a -13.1% regression of will-it-scale.per_thread_ops due to commit:

I took a look at the test, it
1 creates an eventfd with the counter's initial value set to 0;
2 writes 1 to this eventfd, i.e. set its counter to 1;
3 does read, i.e. return the value of 1 and reset the counter to 0;
4 loop to step 2.

I don't see move_vma/move_page_tables/move_ptes involved in this test
though.

I also tried trace-cmd to see if I missed anything:
# trace-cmd record -p function --func-stack -l move_vma -l move_page_tables -l move_huge_pmd ./eventfd1_threads -t 1 -s 10

The report is:
# /usr/local/bin/trace-cmd report
CPU 0 is empty
CPU 1 is empty
CPU 2 is empty
CPU 3 is empty
CPU 4 is empty
CPU 5 is empty
CPU 6 is empty
cpus=8
eventfd1_thread-21210 [007] 2626.438884: function: move_page_tables
eventfd1_thread-21210 [007] 2626.438889: kernel_stack: <stack trace>
=> setup_arg_pages (ffffffff81282ac0)
=> load_elf_binary (ffffffff812e4503)
=> search_binary_handler (ffffffff81282268)
=> exec_binprm (ffffffff8149d6ab)
=> do_execveat_common.isra.41 (ffffffff81284882)
=> do_execve (ffffffff8128498c)
=> SyS_execve (ffffffff81284c3e)
=> do_syscall_64 (ffffffff81002a76)
=> return_from_SYSCALL_64 (ffffffff81cbd509)

i.e., only one call of move_page_tables is in the log, and it's at the
very beginning of the run, not during.

I also did the run on my Sandybridge desktop(4 cores 8 threads with 12G
memory) and a Broadwell EP, they don't have this performance drop
either. Perhaps this problem only occurs on some machine.

I'll continue to take a look at this report, hopefully I can figure out
why this commit is bisected while its change doesn't seem to play a part
in the test. Your comments are greatly appreciated.

Thanks,
Aaron

>
>
> commit 5d1904204c99596b50a700f092fe49d78edba400 ("mremap: fix race between mremap() and page cleanning")
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>
> in testcase: will-it-scale
> on test machine: 12 threads Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz with 6G memory
> with following parameters:
>
> test: eventfd1
> cpufreq_governor: performance
>
> test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> test-url: https://github.com/antonblanchard/will-it-scale
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone git://git.kernel.org/pub/scm/linux/kernel/git/wfg/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase:
> gcc-6/performance/x86_64-rhel-7.2/debian-x86_64-2016-08-31.cgz/wsm/eventfd1/will-it-scale
>
> commit:
> 961b708e95 (" fixes for amdgpu, and a bunch of arm drivers.")
> 5d1904204c ("mremap: fix race between mremap() and page cleanning")
>
> 961b708e95181041 5d1904204c99596b50a700f092
> ---------------- --------------------------
> fail:runs %reproduction fail:runs
> | | |
> %stddev %change %stddev
> \ | \
> 2459656 ± 0% -13.1% 2137017 ± 1% will-it-scale.per_thread_ops
> 2865527 ± 3% +4.2% 2986100 ± 0% will-it-scale.per_process_ops
> 0.62 ± 11% -13.2% 0.54 ± 1% will-it-scale.scalability
> 893.40 ± 0% +1.3% 905.24 ± 0% will-it-scale.time.system_time
> 169.92 ± 0% -7.0% 158.09 ± 0% will-it-scale.time.user_time
> 176943 ± 6% +26.1% 223131 ± 11% cpuidle.C1E-NHM.time
> 10.00 ± 6% -10.9% 8.91 ± 4% turbostat.CPU%c6
> 30508 ± 1% +3.4% 31541 ± 0% vmstat.system.cs
> 27239 ± 0% +1.5% 27650 ± 0% vmstat.system.in
> 2.03 ± 2% -11.6% 1.80 ± 6% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64
> 4.11 ± 1% -12.0% 3.61 ± 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_swapgs
> 1.70 ± 3% -13.8% 1.46 ± 5% perf-profile.children.cycles-pp.__fget_light
> 2.03 ± 2% -11.6% 1.80 ± 6% perf-profile.children.cycles-pp.entry_SYSCALL_64
> 4.11 ± 1% -12.0% 3.61 ± 4% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_swapgs
> 12.79 ± 1% -10.0% 11.50 ± 6% perf-profile.children.cycles-pp.selinux_file_permission
> 1.70 ± 3% -13.8% 1.46 ± 5% perf-profile.self.cycles-pp.__fget_light
> 2.03 ± 2% -11.6% 1.80 ± 6% perf-profile.self.cycles-pp.entry_SYSCALL_64
> 4.11 ± 1% -12.0% 3.61 ± 4% perf-profile.self.cycles-pp.entry_SYSCALL_64_after_swapgs
> 5.85 ± 2% -12.5% 5.12 ± 5% perf-profile.self.cycles-pp.selinux_file_permission
> 1.472e+12 ± 0% -5.5% 1.392e+12 ± 0% perf-stat.branch-instructions
> 0.89 ± 0% -6.0% 0.83 ± 0% perf-stat.branch-miss-rate%
> 1.303e+10 ± 0% -11.1% 1.158e+10 ± 0% perf-stat.branch-misses
> 5.534e+08 ± 4% -6.9% 5.151e+08 ± 1% perf-stat.cache-references
> 9347877 ± 1% +3.4% 9663609 ± 0% perf-stat.context-switches
> 2.298e+12 ± 0% -5.6% 2.168e+12 ± 0% perf-stat.dTLB-loads
> 1.525e+12 ± 1% -5.4% 1.442e+12 ± 0% perf-stat.dTLB-stores
> 7.795e+12 ± 0% -5.5% 7.363e+12 ± 0% perf-stat.iTLB-loads
> 6.694e+12 ± 1% -4.5% 6.391e+12 ± 2% perf-stat.instructions
> 0.93 ± 0% -5.5% 0.88 ± 0% perf-stat.ipc
> 119024 ± 5% -11.3% 105523 ± 8% sched_debug.cfs_rq:/.exec_clock.max
> 5933459 ± 19% +24.5% 7385120 ± 3% sched_debug.cpu.nr_switches.max
> 1684848 ± 15% +20.6% 2032107 ± 3% sched_debug.cpu.nr_switches.stddev
> 5929704 ± 19% +24.5% 7382036 ± 3% sched_debug.cpu.sched_count.max
> 1684318 ± 15% +20.6% 2031701 ± 3% sched_debug.cpu.sched_count.stddev
> 2826278 ± 18% +30.4% 3684493 ± 3% sched_debug.cpu.sched_goidle.max
> 804195 ± 14% +26.2% 1014783 ± 3% sched_debug.cpu.sched_goidle.stddev
> 2969365 ± 19% +24.3% 3692180 ± 3% sched_debug.cpu.ttwu_count.max
> 843614 ± 15% +20.5% 1016263 ± 3% sched_debug.cpu.ttwu_count.stddev
> 2963657 ± 19% +24.4% 3687897 ± 3% sched_debug.cpu.ttwu_local.max
> 843104 ± 15% +20.5% 1016333 ± 3% sched_debug.cpu.ttwu_local.stddev
>
>
>
> will-it-scale.time.user_time
>
> 172 ++--------------------*--------*---*----------------------------------+
> 170 ++..*....*...*....*. *.. . ..*.... ..*...*.... |
> *. *....*. *. *...*
> 168 ++ |
> 166 ++ |
> | |
> 164 ++ |
> 162 ++ |
> 160 ++ |
> | O O |
> 158 ++ O O O |
> 156 ++ O |
> O O |
> 154 ++ O O |
> 152 ++--O-----------------------------------------------------------------+
>
>
> will-it-scale.time.system_time
>
> 912 ++--------------------------------------------------------------------+
> 910 ++ O O |
> O O O |
> 908 ++ O |
> 906 ++ O O O |
> 904 ++ O O |
> 902 ++ |
> | |
> 900 ++ |
> 898 ++ |
> 896 ++ |
> 894 ++ ..*
> *...*.... ..*.... ..*....*...*....*...*...*....*. |
> 892 ++ *. *...*...*....*...*. |
> 890 ++--------------------------------------------------------------------+
>
>
> will-it-scale.per_thread_ops
>
> 2.55e+06 ++---------------------------------------------------------------+
> 2.5e+06 ++ .*...*.. .*...*... ..*. .*.. |
> |.. . .. *...*....*. .. .*... .. . |
> 2.45e+06 *+ * .. * *...*...*
> 2.4e+06 ++ * |
> 2.35e+06 ++ |
> 2.3e+06 ++ |
> | |
> 2.25e+06 ++ |
> 2.2e+06 ++ O |
> 2.15e+06 ++ O O O O |
> 2.1e+06 ++ O O |
> O O O |
> 2.05e+06 ++ O |
> 2e+06 ++---------------------------------------------------------------+
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Xiaolong