Re: [LKP] [bpf] fd978bf7fd: will-it-scale.per_process_ops -4.0% regression

From: Aaron Lu
Date: Thu Nov 08 2018 - 21:56:40 EST


On Fri, Nov 09, 2018 at 08:19:54AM +0800, Rong Chen wrote:
>
>
> On 11/02/2018 04:36 PM, Daniel Borkmann wrote:
> > Hi Rong,
> >
> > On 11/02/2018 03:14 AM, kernel test robot wrote:
> > > Greeting,
> > >
> > > FYI, we noticed a -4.0% regression of will-it-scale.per_process_ops due to commit:
> > >
> > >
> > > commit: fd978bf7fd312581a7ca454a991f0ffb34c4204b ("bpf: Add reference tracking to verifier")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > in testcase: will-it-scale
> > > on test machine: 80 threads Skylake with 64G memory
> > > with following parameters:
> > >
> > > nr_task: 100%
> > > mode: process
> > > test: mmap1
> > > cpufreq_governor: performance
> > Hmm, so the test cases you are running are these ones:
> >
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap1.c
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/mmap2.c
> >
> > The commit from Joe referenced above only adds a feature to the (eBPF) verifier. Looking
> > through will-it-scale test suite, looks like there's neither cBPF nor eBPF in use and if
> > it would have been the former (e.g. via seccomp BPF), then also this has no effect on it
> > since this doesn't load through bpf(2); meaning if so then something must use eBPF here,
> > but then it's also unclear right now how this would even remotely affect mmap() test
> > performance by -4%. Hm, are you certain it's not a false bisection? If so, what else is
> > loading eBPF on your machine in parallel when you run the tests?
>
> Please accept my apologies for taking your time, It's a false bisection.

It shouldn't be a false bisection - I rerun the two kernels using this
workload with this commit and its parent on the same test box, the
performance gap is there.

> Something strange happened, we're trying to figure out the root cause.

Yeah, it's not clear what it is :-)

Daniel & Joe,

We will see if we can find something and will let you know if we do.
Before then, I think you can ignore this report, thanks for your time.

Regards,
Aaron

> > > test-description: Will It Scale takes a testcase and runs it from 1 through to n parallel copies to see if the testcase will scale. It builds both a process and threads based test in order to see any differences between the two.
> > > test-url: https://github.com/antonblanchard/will-it-scale
> > >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_process_ops -3.8% regression |
> > > | test machine | 80 threads Skylake with 64G memory |
> > > | test parameters | cpufreq_governor=performance |
> > > | | mode=process |
> > > | | nr_task=100% |
> > > | | test=mmap2 |
> > > +------------------+---------------------------------------------------------------+
> > >
> > >
> > > Details are as below:
> > > -------------------------------------------------------------------------------------------------->
> > >
> > >
> > > To reproduce:
> > >
> > > git clone https://github.com/intel/lkp-tests.git
> > > cd lkp-tests
> > > bin/lkp install job.yaml # job file is attached in this email
> > > bin/lkp run job.yaml
> > >
> > > =========================================================================================
> > > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
> > > gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap1/will-it-scale
> > >
> > > commit:
> > > 84dbf35073 ("bpf: Macrofy stack state copy")
> > > fd978bf7fd ("bpf: Add reference tracking to verifier")
> > >
> > > 84dbf3507349696b fd978bf7fd312581a7ca454a99
> > > ---------------- --------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 16811 -4.0% 16140 will-it-scale.per_process_ops
> > > 1344946 -4.0% 1291230 will-it-scale.workload
> > > 107.75 ± 38% +252.4% 379.75 ± 93% cpuidle.POLL.usage
> > > 121.70 ± 18% +18.9% 144.70 ± 4% sched_debug.cfs_rq:/.exec_clock.stddev
> > > 4933 +2.0% 5031 proc-vmstat.nr_inactive_anon
> > > 4933 +2.0% 5031 proc-vmstat.nr_zone_inactive_anon
> > > 9874 +9.0% 10765 ± 7% slabinfo.proc_inode_cache.active_objs
> > > 9874 +9.0% 10765 ± 7% slabinfo.proc_inode_cache.num_objs
> > > 12248 ± 50% +52.2% 18640 ± 2% numa-meminfo.node0.Inactive
> > > 33943 ± 8% +16.2% 39453 ± 6% numa-meminfo.node0.SReclaimable
> > > 91549 ± 9% -9.9% 82530 ± 7% numa-meminfo.node1.Slab
> > > 18091 ± 15% +29.2% 23382 ± 17% numa-vmstat.node0
> > > 3027 ± 52% +52.6% 4620 ± 3% numa-vmstat.node0.nr_inactive_anon
> > > 8485 ± 8% +16.2% 9862 ± 6% numa-vmstat.node0.nr_slab_reclaimable
> > > 3027 ± 52% +52.6% 4620 ± 3% numa-vmstat.node0.nr_zone_inactive_anon
> > > 1.4e+12 -2.5% 1.364e+12 perf-stat.branch-instructions
> > > 41.42 +0.7 42.15 perf-stat.cache-miss-rate%
> > > 2.166e+10 -2.1% 2.12e+10 perf-stat.cache-references
> > > 12.16 +2.7% 12.49 perf-stat.cpi
> > > 1.741e+12 -2.8% 1.692e+12 perf-stat.dTLB-loads
> > > 0.00 ± 3% +0.0 0.00 ± 9% perf-stat.dTLB-store-miss-rate%
> > > 5.713e+11 -3.9% 5.49e+11 perf-stat.dTLB-stores
> > > 6.103e+12 -2.6% 5.943e+12 perf-stat.instructions
> > > 0.08 -2.6% 0.08 perf-stat.ipc
> > > 1.954e+09 -1.8% 1.919e+09 perf-stat.node-load-misses
> > > 4538060 +1.4% 4602862 perf-stat.path-length
> > > 49.62 -0.5 49.14 perf-profile.calltrace.cycles-pp.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 47.64 -0.5 47.17 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
> > > 47.49 -0.5 47.02 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
> > > 49.99 -0.5 49.53 perf-profile.calltrace.cycles-pp.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 49.96 -0.5 49.51 perf-profile.calltrace.cycles-pp.vm_munmap.__x64_sys_munmap.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 48.02 -0.4 47.58 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
> > > 1.41 -0.0 1.37 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
> > > 47.73 +0.4 48.11 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region
> > > 47.85 +0.4 48.25 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap
> > > 48.28 +0.4 48.68 perf-profile.calltrace.cycles-pp.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
> > > 48.23 +0.4 48.63 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.__vm_enough_memory.mmap_region.do_mmap.vm_mmap_pgoff
> > > 48.96 +0.4 49.41 perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
> > > 49.11 +0.5 49.56 perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 49.24 +0.5 49.70 perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 49.25 +0.5 49.72 perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> > > 49.62 -0.5 49.15 perf-profile.children.cycles-pp.do_munmap
> > > 49.99 -0.5 49.53 perf-profile.children.cycles-pp.__x64_sys_munmap
> > > 49.97 -0.5 49.51 perf-profile.children.cycles-pp.vm_munmap
> > > 0.51 ± 2% -0.0 0.46 perf-profile.children.cycles-pp.___might_sleep
> > > 1.16 -0.0 1.11 perf-profile.children.cycles-pp.unmap_vmas
> > > 1.15 -0.0 1.10 perf-profile.children.cycles-pp.unmap_page_range
> > > 1.41 -0.0 1.37 perf-profile.children.cycles-pp.unmap_region
> > > 0.32 ± 2% +0.0 0.34 ± 2% perf-profile.children.cycles-pp.up_write
> > > 0.32 ± 2% +0.0 0.34 perf-profile.children.cycles-pp.vm_area_alloc
> > > 0.29 +0.0 0.32 perf-profile.children.cycles-pp.kmem_cache_alloc
> > > 48.28 +0.4 48.68 perf-profile.children.cycles-pp.__vm_enough_memory
> > > 48.96 +0.4 49.41 perf-profile.children.cycles-pp.mmap_region
> > > 49.11 +0.5 49.56 perf-profile.children.cycles-pp.do_mmap
> > > 49.25 +0.5 49.71 perf-profile.children.cycles-pp.vm_mmap_pgoff
> > > 49.25 +0.5 49.72 perf-profile.children.cycles-pp.ksys_mmap_pgoff
> > > 0.47 ± 3% -0.0 0.43 perf-profile.self.cycles-pp.___might_sleep
> > > 0.32 ± 3% +0.0 0.34 ± 2% perf-profile.self.cycles-pp.up_write
> > > 0.27 +0.0 0.30 perf-profile.self.cycles-pp.kmem_cache_alloc
> > > 0.49 +0.0 0.53 perf-profile.self.cycles-pp.percpu_counter_add_batch
> > >
> > >
> > > will-it-scale.per_process_ops
> > > 18000 +-+-----------------------------------------------------------------+
> > > | |
> > > 17500 +-+ +.+ |
> > > |+.+++ : +.++++.+++ ++++.++++.++ |
> > > | :++. + : : : : |
> > > 17000 +-+ + + ++.++: : ++.+++ : ++.+ ++. +. |
> > > | + +.+ + + +++ +|
> > > 16500 +-+ |
> > > | O OOOO OOOO O O |
> > > 16000 +-+ O O O O O |
> > > | |
> > > O O OOO O |
> > > 15500 +O+OOO O |
> > > | |
> > > 15000 +-+-----------------------------------------------------------------+
> > > will-it-scale.workload
> > > 1.42e+06 +-+--------------------------------------------------------------+
> > > 1.4e+06 +-+ ++ |
> > > |++.++ : ++. +++.+ |
> > > 1.38e+06 +-+ : +.+++ ++ ++++.++ : |
> > > 1.36e+06 +-+ +.+++++. : : : :+ |
> > > | ++++ ++.+++++.+ + ++.+++++.++|
> > > 1.34e+06 +-+ |
> > > 1.32e+06 +-+ |
> > > 1.3e+06 +-+ O |
> > > | OO OO OO OOOOO OOO |
> > > 1.28e+06 +-+ O |
> > > 1.26e+06 +-+ O |
> > > O O O OO |
> > > 1.24e+06 +OO OO O |
> > > 1.22e+06 +-+--------------------------------------------------------------+
> > > [*] bisect-good sample
> > > [O] bisect-bad sample
> > >
> > > ***************************************************************************************************
> > > lkp-skl-2sp2: 80 threads Skylake with 64G memory
> > > =========================================================================================
> > > compiler/cpufreq_governor/kconfig/mode/nr_task/rootfs/tbox_group/test/testcase:
> > > gcc-7/performance/x86_64-rhel-7.2/process/100%/debian-x86_64-2018-04-03.cgz/lkp-skl-2sp2/mmap2/will-it-scale
> > >
> > > commit:
> > > 84dbf35073 ("bpf: Macrofy stack state copy")
> > > fd978bf7fd ("bpf: Add reference tracking to verifier")
> > >
> > > 84dbf3507349696b fd978bf7fd312581a7ca454a99
> > > ---------------- --------------------------
> > > %stddev %change %stddev
> > > \ | \
> > > 16832 -3.8% 16186 will-it-scale.per_process_ops
> > > 1346634 -3.8% 1294984 will-it-scale.workload
> > > 390809 ± 21% +51.6% 592424 ± 27% cpuidle.C1.time
> > > 6897 +2.7% 7085 proc-vmstat.nr_mapped
> > > 936.00 ± 7% +15.6% 1082 ± 5% slabinfo.Acpi-ParseExt.active_objs
> > > 936.00 ± 7% +15.6% 1082 ± 5% slabinfo.Acpi-ParseExt.num_objs
> > > 968.00 ± 9% +27.5% 1233 ± 16% slabinfo.pool_workqueue.active_objs
> > > 968.00 ± 9% +29.7% 1255 ± 16% slabinfo.pool_workqueue.num_objs
> > > 8430 -14.1% 7244 ± 2% numa-meminfo.node0.KernelStack
> > > 4283 ± 14% -22.4% 3325 ± 10% numa-meminfo.node0.PageTables
> > > 73929 ± 3% -10.6% 66061 ± 6% numa-meminfo.node0.SUnreclaim
> > > 5569 ± 2% +21.0% 6738 ± 3% numa-meminfo.node1.KernelStack
> > > 55085 ± 5% +17.5% 64739 ± 5% numa-meminfo.node1.SUnreclaim
> > > 81155 ± 6% +16.2% 94292 ± 7% numa-meminfo.node1.Slab
> > > 230.00 -100.0% 0.00 numa-vmstat.node0.nr_active_file
> > > 100.25 ± 3% -88.8% 11.25 ±173% numa-vmstat.node0.nr_inactive_file
> > > 8431 -14.1% 7245 ± 2% numa-vmstat.node0.nr_kernel_stack
> > > 1071 ± 14% -22.4% 831.25 ± 10% numa-vmstat.node0.nr_page_table_pages
> > > 18482 ± 3% -10.6% 16515 ± 6% numa-vmstat.node0.nr_slab_unreclaimable
> > > 230.00 -100.0% 0.00 numa-vmstat.node0.nr_zone_active_file
> > > 100.25 ± 3% -88.8% 11.25 ±173% numa-vmstat.node0.nr_zone_inactive_file
> > > 5569 ± 2% +21.0% 6738 ± 3% numa-vmstat.node1.nr_kernel_stack
> > > 2778 ± 3% +28.4% 3567 ± 16% numa-vmstat.node1.nr_mapped
> > > 13771 ± 5% +17.5% 16184 ± 5% numa-vmstat.node1.nr_slab_unreclaimable
> > > 1.506e+12 -2.5% 1.468e+12 perf-stat.branch-instructions
> > > 41.41 +0.8 42.20 perf-stat.cache-miss-rate%
> > > 2.165e+10 -1.7% 2.129e+10 perf-stat.cache-references
> > > 11.25 +2.8% 11.57 perf-stat.cpi
> > > 1.891e+12 -2.8% 1.838e+12 perf-stat.dTLB-loads
> > > 6.543e+11 -3.7% 6.3e+11 perf-stat.dTLB-stores
> > > 6.591e+12 -2.6% 6.419e+12 perf-stat.instructions
> > > 0.09 -2.7% 0.09 perf-stat.ipc
> > > 1.967e+09 -1.3% 1.941e+09 perf-stat.node-load-misses
> > > 4894750 +1.3% 4956596 perf-stat.path-length
> > > 40.37 ± 12% -16.2% 33.81 ± 7% sched_debug.cfs_rq:/.load_avg.stddev
> > > 0.05 ± 2% +18.7% 0.06 ± 3% sched_debug.cfs_rq:/.nr_running.stddev
> > > 6.37 ± 40% -50.2% 3.17 ± 32% sched_debug.cfs_rq:/.removed.load_avg.avg
> > > 31.64 ± 18% -28.5% 22.63 ± 16% sched_debug.cfs_rq:/.removed.load_avg.stddev
> > > 293.89 ± 40% -50.1% 146.61 ± 32% sched_debug.cfs_rq:/.removed.runnable_sum.avg
> > > 1459 ± 18% -28.3% 1045 ± 16% sched_debug.cfs_rq:/.removed.runnable_sum.stddev
> > > 2.46 ± 43% -60.9% 0.96 ± 66% sched_debug.cfs_rq:/.removed.util_avg.avg
> > > 12.42 ± 26% -46.5% 6.64 ± 59% sched_debug.cfs_rq:/.removed.util_avg.stddev
> > > 385.92 ± 6% +12.8% 435.46 ± 2% sched_debug.cpu.nr_switches.min
> > > -14.21 -31.4% -9.75 sched_debug.cpu.nr_uninterruptible.min
> > > 47.54 -0.2 47.31 perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap
> > > 47.67 -0.2 47.45 perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap
> > > 48.04 -0.2 47.86 perf-profile.calltrace.cycles-pp.percpu_counter_add_batch.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
> > > 99.36 -0.0 99.34 perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
> > > 1.47 +0.0 1.51 perf-profile.calltrace.cycles-pp.unmap_region.do_munmap.vm_munmap.__x64_sys_munmap.do_syscall_64
> > > 94.77 -0.3 94.52 perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> > > 95.04 -0.2 94.81 perf-profile.children.cycles-pp._raw_spin_lock_irqsave
> > > 95.77 -0.2 95.60 perf-profile.children.cycles-pp.percpu_counter_add_batch
> > > 49.72 -0.1 49.58 perf-profile.children.cycles-pp.do_munmap
> > > 0.53 ± 2% -0.1 0.47 perf-profile.children.cycles-pp.___might_sleep
> > > 0.30 ± 2% +0.0 0.33 perf-profile.children.cycles-pp.perf_event_mmap
> > > 0.30 ± 3% +0.0 0.33 ± 2% perf-profile.children.cycles-pp.vm_area_alloc
> > > 0.33 ± 2% +0.0 0.36 ± 2% perf-profile.children.cycles-pp.up_write
> > > 1.48 +0.0 1.51 perf-profile.children.cycles-pp.unmap_region
> > > 94.77 -0.3 94.52 perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
> > > 0.48 ± 2% -0.0 0.44 perf-profile.self.cycles-pp.___might_sleep
> > > 0.33 ± 2% +0.0 0.36 ± 2% perf-profile.self.cycles-pp.up_write
> > > 0.53 +0.0 0.57 perf-profile.self.cycles-pp.unmap_page_range
> > > 0.47 +0.0 0.52 ± 2% perf-profile.self.cycles-pp.percpu_counter_add_batch
> > >
> > >
> > >
> > >
> > >
> > > Disclaimer:
> > > Results have been estimated based on internal Intel analysis and are provided
> > > for informational purposes only. Any difference in system hardware or software
> > > design or configuration may affect actual performance.
> > >
> > >
> > > Thanks,
> > > Rong Chen
> > >
>
> _______________________________________________
> LKP mailing list
> LKP@xxxxxxxxxxxx
> https://lists.01.org/mailman/listinfo/lkp