Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0% regression

From: Rong Chen
Date: Thu Nov 08 2018 - 19:20:11 EST




On 11/02/2018 04:36 PM, Daniel Borkmann wrote:
Hi Rong,

On 11/02/2018 03:14 AM, kernel test robot wrote:
Greeting,

FYI, we noticed a -4.0% regression of will-it-scale.per_process_ops due to commit:


commit: fd978bf7fd312581a7ca454a991f0ffb34c4204b ("bpf: Add reference tracking to verifier")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

in testcase: will-it-scale
on test machine: 80 threads Skylake with 64G memory
with following parameters:

nr_task: 100%
mode: process
test: mmap1
cpufreq_governor: performance
Hmm, so the test cases you are running are these ones:

https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap1.c
https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap2.c

The commit from Joe referenced above only adds a feature to the (eBPF) verifier. Looking
through will-it-scale test suite, looks like there's neither cBPF nor eBPF in use and if
it would have been the former (e.g. via seccomp BPF), then also this has no effect on it
since this doesn't load through bpf(2); meaning if so then something must use eBPF here,
but then it's also unclear right now how this would even remotely affect mmap() test
performance by -4%. Hm, are you certain it's not a false bisection? If so, what else is
loading eBPF on your machine in parallel when you run the tests?

Please accept my apologies for taking your time, It's a false bisection.
Something strange happened, we're trying to figure out the root cause.

Best Regards,
Rong Chen


Thanks,
Daniel

test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
test-url: https://github.com/antonblanchard/will-it-scale

In addition to that, the commit also has significant impact on the following tests:

+------------------+---------------------------------------------------------------+
| testcase: change | will-it-scale: will-it-scale.per_process_ops -3.8% regression |
| test machine | 80 threads Skylake with 64G memory |
| test parameters | cpufreq_governor=performance |
| | mode=process |
| | nr_task=100% |
| | test=mmap2 |
+------------------+---------------------------------------------------------------+


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap1/will-it-scale

commit:
84dbf35073 ("bpf: Macrofy stack state copy")
fd978bf7fd ("bpf: Add reference tracking to verifier")

84dbf3507349696b fd978bf7fd312581a7ca454a99
---------------- --------------------------
%stddev %change %stddev
\ | \
16811 -4.0% 16140 will-it-scale.per_process_ops
1344946 -4.0% 1291230 will-it-scale.workload
107.75 ± 38% +252.4% 379.75 ± 93% cpuidle.POLL.usage
121.70 ± 18% +18.9% 144.70 ± 4% sched_debug.cfs_rq:/.exec_clock.stddev
4933 +2.0% 5031 proc-vmstat.nr_inactive_anon
4933 +2.0% 5031 proc-vmstat.nr_zone_inactive_anon
9874 +9.0% 10765 ± 7% slabinfo.proc_inode_cache.active_objs
9874 +9.0% 10765 ± 7% slabinfo.proc_inode_cache.num_objs
12248 ± 50% +52.2% 18640 ± 2% numa-meminfo.node0.Inactive
33943 ± 8% +16.2% 39453 ± 6% numa-meminfo.node0.SReclaimable
91549 ± 9% -9.9% 82530 ± 7% numa-meminfo.node1.Slab
18091 ± 15% +29.2% 23382 ± 17% numa-vmstat.node0
3027 ± 52% +52.6% 4620 ± 3% numa-vmstat.node0.nr_inactive_anon
8485 ± 8% +16.2% 9862 ± 6% numa-vmstat.node0.nr_slab_reclaimable
3027 ± 52% +52.6% 4620 ± 3% numa-vmstat.node0.nr_zone_inactive_anon
1.4e+12 -2.5% 1.364e+12 perf-stat.branch-instructions
41.42 +0.7 42.15 perf-stat.cache-miss-rate%
2.166e+10 -2.1% 2.12e+10 perf-stat.cache-references
12.16 +2.7% 12.49 perf-stat.cpi
1.741e+12 -2.8% 1.692e+12 perf-stat.dTLB-loads
0.00 ± 3% +0.0 0.00 ± 9% perf-stat.dTLB-store-miss-rate%
5.713e+11 -3.9% 5.49e+11 perf-stat.dTLB-stores
6.103e+12 -2.6% 5.943e+12 perf-stat.instructions
0.08 -2.6% 0.08 perf-stat.ipc
1.954e+09 -1.8% 1.919e+09 perf-stat.node-load-misses
4538060 +1.4% 4602862 perf-stat.path-length
49.62 -0.5 49.14 perf-profile.calltrace.cycles-pp.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
47.64 -0.5 47.17 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
47.49 -0.5 47.02 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
49.99 -0.5 49.53 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.96 -0.5 49.51 perf-profile.calltrace.cycles-pp.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
48.02 -0.4 47.58 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
1.41 -0.0 1.37 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
47.73 +0.4 48.11 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region
47.85 +0.4 48.25 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap
48.28 +0.4 48.68 perf-profile.calltrace.cycles-pp.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
48.23 +0.4 48.63 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff
48.96 +0.4 49.41 perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
49.11 +0.5 49.56 perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.24 +0.5 49.70 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.25 +0.5 49.72 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
49.62 -0.5 49.15 perf-profile.children.cycles-pp.do_munmap
49.99 -0.5 49.53 perf-profile.children.cycles-pp.__x64_sys_munmap
49.97 -0.5 49.51 perf-profile.children.cycles-pp.vm_munmap
0.51 ± 2% -0.0 0.46 perf-profile.children.cycles-pp.___might_sleep
1.16 -0.0 1.11 perf-profile.children.cycles-pp.unmap_vmas
1.15 -0.0 1.10 perf-profile.children.cycles-pp.unmap_page_range
1.41 -0.0 1.37 perf-profile.children.cycles-pp.unmap_region
0.32 ± 2% +0.0 0.34 ± 2% perf-profile.children.cycles-pp.up_write
0.32 ± 2% +0.0 0.34 perf-profile.children.cycles-pp.vm_area_alloc
0.29 +0.0 0.32 perf-profile.children.cycles-pp.kmem_cache_alloc
48.28 +0.4 48.68 perf-profile.children.cycles-pp.__vm_enough_memory
48.96 +0.4 49.41 perf-profile.children.cycles-pp.mmap_region
49.11 +0.5 49.56 perf-profile.children.cycles-pp.do_mmap
49.25 +0.5 49.71 perf-profile.children.cycles-pp.vm_mmap_pgoff
49.25 +0.5 49.72 perf-profile.children.cycles-pp.ksys_mmap_pgoff
0.47 ± 3% -0.0 0.43 perf-profile.self.cycles-pp.___might_sleep
0.32 ± 3% +0.0 0.34 ± 2% perf-profile.self.cycles-pp.up_write
0.27 +0.0 0.30 perf-profile.self.cycles-pp.kmem_cache_alloc
0.49 +0.0 0.53 perf-profile.self.cycles-pp.percpu_counter_add_batch


will-it-scale.per_process_ops
18000 +-+-----------------------------------------------------------------+
| |
17500 +-+ +.+ |
|+.+++ : +.++++.+++ ++++.++++.++ |
| :++. + : : : : |
17000 +-+ + + ++.++: : ++.+++ : ++.+ ++. +. |
| + +.+ + + +++ +|
16500 +-+ |
| O OOOO OOOO O O |
16000 +-+ O O O O O |
| |
O O OOO O |
15500 +O+OOO O |
| |
15000 +-+-----------------------------------------------------------------+
will-it-scale.workload
1.42e+06 +-+--------------------------------------------------------------+
1.4e+06 +-+ ++ |
|++.++ : ++. +++.+ |
1.38e+06 +-+ : +.+++ ++ ++++.++ : |
1.36e+06 +-+ +.+++++. : : : :+ |
| ++++ ++.+++++.+ + ++.+++++.++|
1.34e+06 +-+ |
1.32e+06 +-+ |
1.3e+06 +-+ O |
| OO OO OO OOOOO OOO |
1.28e+06 +-+ O |
1.26e+06 +-+ O |
O O O OO |
1.24e+06 +OO OO O |
1.22e+06 +-+--------------------------------------------------------------+
[*] bisect-good sample
[O] bisect-bad sample

***************************************************************************************************
lkp-skl-2sp2: 80 threads Skylake with 64G memory
=========================================================================================
compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap2/will-it-scale

commit:
84dbf35073 ("bpf: Macrofy stack state copy")
fd978bf7fd ("bpf: Add reference tracking to verifier")

84dbf3507349696b fd978bf7fd312581a7ca454a99
---------------- --------------------------
%stddev %change %stddev
\ | \
16832 -3.8% 16186 will-it-scale.per_process_ops
1346634 -3.8% 1294984 will-it-scale.workload
390809 ± 21% +51.6% 592424 ± 27% cpuidle.C1.time
6897 +2.7% 7085 proc-vmstat.nr_mapped
936.00 ± 7% +15.6% 1082 ± 5% slabinfo.Acpi-ParseExt.active_objs
936.00 ± 7% +15.6% 1082 ± 5% slabinfo.Acpi-ParseExt.num_objs
968.00 ± 9% +27.5% 1233 ± 16% slabinfo.pool_workqueue.active_objs
968.00 ± 9% +29.7% 1255 ± 16% slabinfo.pool_workqueue.num_objs
8430 -14.1% 7244 ± 2% numa-meminfo.node0.KernelStack
4283 ± 14% -22.4% 3325 ± 10% numa-meminfo.node0.PageTables
73929 ± 3% -10.6% 66061 ± 6% numa-meminfo.node0.SUnreclaim
5569 ± 2% +21.0% 6738 ± 3% numa-meminfo.node1.KernelStack
55085 ± 5% +17.5% 64739 ± 5% numa-meminfo.node1.SUnreclaim
81155 ± 6% +16.2% 94292 ± 7% numa-meminfo.node1.Slab
230.00 -100.0% 0.00 numa-vmstat.node0.nr_active_file
100.25 ± 3% -88.8% 11.25 ±173% numa-vmstat.node0.nr_inactive_file
8431 -14.1% 7245 ± 2% numa-vmstat.node0.nr_kernel_stack
1071 ± 14% -22.4% 831.25 ± 10% numa-vmstat.node0.nr_page_table_pages
18482 ± 3% -10.6% 16515 ± 6% numa-vmstat.node0.nr_slab_unreclaimable
230.00 -100.0% 0.00 numa-vmstat.node0.nr_zone_active_file
100.25 ± 3% -88.8% 11.25 ±173% numa-vmstat.node0.nr_zone_inactive_file
5569 ± 2% +21.0% 6738 ± 3% numa-vmstat.node1.nr_kernel_stack
2778 ± 3% +28.4% 3567 ± 16% numa-vmstat.node1.nr_mapped
13771 ± 5% +17.5% 16184 ± 5% numa-vmstat.node1.nr_slab_unreclaimable
1.506e+12 -2.5% 1.468e+12 perf-stat.branch-instructions
41.41 +0.8 42.20 perf-stat.cache-miss-rate%
2.165e+10 -1.7% 2.129e+10 perf-stat.cache-references
11.25 +2.8% 11.57 perf-stat.cpi
1.891e+12 -2.8% 1.838e+12 perf-stat.dTLB-loads
6.543e+11 -3.7% 6.3e+11 perf-stat.dTLB-stores
6.591e+12 -2.6% 6.419e+12 perf-stat.instructions
0.09 -2.7% 0.09 perf-stat.ipc
1.967e+09 -1.3% 1.941e+09 perf-stat.node-load-misses
4894750 +1.3% 4956596 perf-stat.path-length
40.37 ± 12% -16.2% 33.81 ± 7% sched_debug.cfs_rq:/.load_avg.stddev
0.05 ± 2% +18.7% 0.06 ± 3% sched_debug.cfs_rq:/.nr_running.stddev
6.37 ± 40% -50.2% 3.17 ± 32% sched_debug.cfs_rq:/.removed.load_avg.avg
31.64 ± 18% -28.5% 22.63 ± 16% sched_debug.cfs_rq:/.removed.load_avg.stddev
293.89 ± 40% -50.1% 146.61 ± 32% sched_debug.cfs_rq:/.removed.runnable_sum.avg
1459 ± 18% -28.3% 1045 ± 16% sched_debug.cfs_rq:/.removed.runnable_sum.stddev
2.46 ± 43% -60.9% 0.96 ± 66% sched_debug.cfs_rq:/.removed.util_avg.avg
12.42 ± 26% -46.5% 6.64 ± 59% sched_debug.cfs_rq:/.removed.util_avg.stddev
385.92 ± 6% +12.8% 435.46 ± 2% sched_debug.cpu.nr_switches.min
-14.21 -31.4% -9.75 sched_debug.cpu.nr_uninterruptible.min
47.54 -0.2 47.31 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
47.67 -0.2 47.45 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
48.04 -0.2 47.86 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
99.36 -0.0 99.34 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
1.47 +0.0 1.51 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
94.77 -0.3 94.52 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
95.04 -0.2 94.81 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
95.77 -0.2 95.60 perf-profile.children.cycles-pp.percpu_counter_add_batch
49.72 -0.1 49.58 perf-profile.children.cycles-pp.do_munmap
0.53 ± 2% -0.1 0.47 perf-profile.children.cycles-pp.___might_sleep
0.30 ± 2% +0.0 0.33 perf-profile.children.cycles-pp.perf_event_mmap
0.30 ± 3% +0.0 0.33 ± 2% perf-profile.children.cycles-pp.vm_area_alloc
0.33 ± 2% +0.0 0.36 ± 2% perf-profile.children.cycles-pp.up_write
1.48 +0.0 1.51 perf-profile.children.cycles-pp.unmap_region
94.77 -0.3 94.52 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
0.48 ± 2% -0.0 0.44 perf-profile.self.cycles-pp.___might_sleep
0.33 ± 2% +0.0 0.36 ± 2% perf-profile.self.cycles-pp.up_write
0.53 +0.0 0.57 perf-profile.self.cycles-pp.unmap_page_range
0.47 +0.0 0.52 ± 2% perf-profile.self.cycles-pp.percpu_counter_add_batch





Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Rong Chen