Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression
From: Joonsoo Kim
Date: Tue Jan 09 2018 - 02:15:21 EST
On Sat, Jan 06, 2018 at 05:26:31PM +0800, Ye Xiaolong wrote:
> Hi,
>
> On 01/03, Joonsoo Kim wrote:
> >Hello!
> >
> >On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
> >>
> >> Greeting,
> >>
> >> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
> >>
> >>
> >> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
> >> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >>
> >> in testcase: fio-basic
> >> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
> >> with following parameters:
> >>
> >> disk: 2pmem
> >> fs: ext4
> >> runtime: 200s
> >> nr_task: 50%
> >> time_based: tb
> >> rw: randread
> >> bs: 2M
> >> ioengine: mmap
> >> test_size: 200G
> >> cpufreq_governor: performance
> >>
> >> test-description: Fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user.
> >> test-url: https://github.com/axboe/fio
> >>
> >>
> >>
> >> Details are as below:
> >> -------------------------------------------------------------------------------------------------->
> >>
> >>
> >> To reproduce:
> >>
> >> git clone https://github.com/intel/lkp-tests.git
> >> cd lkp-tests
> >> bin/lkp install job.yaml # job file is attached in this email
> >> bin/lkp run job.yaml
> >>
> >> =========================================================================================
> >> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
> >> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
> >>
> >> commit:
> >> f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request")
> >> 2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
> >>
> >> f6572f9cd248df2c 2b0f904a5a8781498417d67226
> >> ---------------- --------------------------
> >> %stddev %change %stddev
> >> \ | \
> >> 11451 -16.1% 9605 fio.read_bw_MBps
> >> 0.29 ± 5% +0.1 0.40 ± 3% fio.latency_1000us%
> >> 19.35 ± 5% -4.7 14.69 ± 3% fio.latency_10ms%
> >> 7.92 ± 3% +12.2 20.15 fio.latency_20ms%
> >> 0.05 ± 11% +0.0 0.09 ± 8% fio.latency_2ms%
> >> 70.22 -8.9 61.36 fio.latency_4ms%
> >> 0.29 ± 13% +0.0 0.33 ± 3% fio.latency_500us%
> >> 0.45 ± 29% +1.0 1.45 ± 4% fio.latency_50ms%
> >> 1.37 +0.1 1.44 fio.latency_750us%
> >> 9792 +31.7% 12896 fio.read_clat_90%_us
> >> 10560 +33.0% 14048 fio.read_clat_95%_us
> >> 15376 ± 10% +46.9% 22592 fio.read_clat_99%_us
> >> 4885 +19.2% 5825 fio.read_clat_mean_us
> >> 5725 -16.1% 4802 fio.read_iops
> >> 4.598e+09 -16.4% 3.845e+09 fio.time.file_system_inputs
> >> 453153 -8.4% 415215 fio.time.involuntary_context_switches
> >> 5.748e+08 -16.4% 4.806e+08 fio.time.major_page_faults
> >> 1822257 +23.7% 2254706 fio.time.maximum_resident_set_size
> >> 5089 +1.6% 5172 fio.time.system_time
> >> 514.50 -16.3% 430.48 fio.time.user_time
> >
> >System time is increased and user time is decreased. On the below, there is a clue.
> >
> >> 24569 ± 2% +9.6% 26917 ± 2% fio.time.voluntary_context_switches
> >> 54443725 -14.9% 46353339 interrupts.CAL:Function_call_interrupts
> >> 0.00 ± 79% -0.0 0.00 ± 17% mpstat.cpu.iowait%
> >> 4.45 -0.7 3.71 mpstat.cpu.usr%
> >> 1467516 +21.3% 1779543 ± 3% meminfo.Active
> >> 1276031 +23.7% 1578443 ± 4% meminfo.Active(file)
> >> 25789 ± 3% -76.7% 6013 ± 4% meminfo.CmaFree
> >> 1.296e+08 -12.6% 1.133e+08 turbostat.IRQ
> >> 41.89 -3.4% 40.47 turbostat.RAMWatt
> >> 17444 ± 2% -13.5% 15092 ± 3% turbostat.SMI
> >> 10896428 -16.4% 9111830 vmstat.io.bi
> >> 6010 -6.2% 5637 vmstat.system.cs
> >> 317438 -12.1% 278980 vmstat.system.in
> >> 1072892 ± 3% +21.5% 1303487 numa-meminfo.node0.Active
> >> 978318 +21.6% 1189809 ± 2% numa-meminfo.node0.Active(file)
> >> 222968 -25.2% 166818 numa-meminfo.node0.PageTables
> >> 47374 ± 2% +10.6% 52402 ± 7% numa-meminfo.node0.SUnreclaim
> >> 165213 +31.9% 217870 numa-meminfo.node1.PageTables
> >> 222405 +10.4% 245633 ± 2% numa-meminfo.node1.SReclaimable
> >> 102992 ± 46% -80.8% 19812 ± 38% numa-meminfo.node1.Shmem
> >> 2.475e+08 ± 2% -24.0% 1.881e+08 numa-numastat.node0.local_node
> >> 39371795 ± 14% +167.1% 1.052e+08 ± 2% numa-numastat.node0.numa_foreign
> >> 2.475e+08 ± 2% -24.0% 1.881e+08 numa-numastat.node0.numa_hit
> >> 31890417 ± 17% +40.2% 44705135 ± 8% numa-numastat.node0.numa_miss
> >> 31899482 ± 17% +40.2% 44713255 ± 8% numa-numastat.node0.other_node
> >> 2.566e+08 ± 2% -44.2% 1.433e+08 numa-numastat.node1.local_node
> >> 31890417 ± 17% +40.2% 44705135 ± 8% numa-numastat.node1.numa_foreign
> >> 2.566e+08 ± 2% -44.2% 1.433e+08 numa-numastat.node1.numa_hit
> >> 39371795 ± 14% +167.1% 1.052e+08 ± 2% numa-numastat.node1.numa_miss
> >> 39373660 ± 14% +167.1% 1.052e+08 ± 2% numa-numastat.node1.other_node
> >> 6047 ± 39% -66.5% 2028 ± 63% sched_debug.cfs_rq:/.exec_clock.min
> >> 461.37 ± 8% +64.9% 760.74 ± 20% sched_debug.cfs_rq:/.load_avg.avg
> >> 1105 ± 13% +1389.3% 16467 ± 56% sched_debug.cfs_rq:/.load_avg.max
> >> 408.99 ± 3% +495.0% 2433 ± 49% sched_debug.cfs_rq:/.load_avg.stddev
> >> 28746 ± 12% -18.7% 23366 ± 14% sched_debug.cfs_rq:/.min_vruntime.min
> >> 752426 ± 3% -12.7% 656636 ± 4% sched_debug.cpu.avg_idle.avg
> >> 144956 ± 61% -85.4% 21174 ± 26% sched_debug.cpu.avg_idle.min
> >> 245684 ± 11% +44.6% 355257 ± 2% sched_debug.cpu.avg_idle.stddev
> >> 236035 ± 15% +51.8% 358264 ± 16% sched_debug.cpu.nr_switches.max
> >> 42039 ± 22% +34.7% 56616 ± 8% sched_debug.cpu.nr_switches.stddev
> >> 3204 ± 24% -48.1% 1663 ± 30% sched_debug.cpu.sched_count.min
> >> 2132 ± 25% +38.7% 2957 ± 11% sched_debug.cpu.sched_count.stddev
> >> 90.67 ± 32% -71.8% 25.58 ± 26% sched_debug.cpu.sched_goidle.min
> >> 6467 ± 15% +22.3% 7912 ± 15% sched_debug.cpu.ttwu_count.max
> >> 1513 ± 27% -55.7% 670.92 ± 22% sched_debug.cpu.ttwu_count.min
> >> 1025 ± 20% +68.4% 1727 ± 9% sched_debug.cpu.ttwu_count.stddev
> >> 1057 ± 16% -62.9% 391.85 ± 31% sched_debug.cpu.ttwu_local.min
> >> 244876 +21.6% 297770 ± 2% numa-vmstat.node0.nr_active_file
> >> 88.00 ± 5% +19.3% 105.00 ± 5% numa-vmstat.node0.nr_isolated_file
> >> 55778 -25.1% 41765 numa-vmstat.node0.nr_page_table_pages
> >> 11843 ± 2% +10.6% 13100 ± 7% numa-vmstat.node0.nr_slab_unreclaimable
> >> 159.25 ± 42% -74.9% 40.00 ± 52% numa-vmstat.node0.nr_vmscan_immediate_reclaim
> >> 244862 +21.6% 297739 ± 2% numa-vmstat.node0.nr_zone_active_file
> >> 19364320 ± 19% +187.2% 55617595 ± 2% numa-vmstat.node0.numa_foreign
> >> 268155 ± 3% +49.6% 401089 ± 4% numa-vmstat.node0.workingset_activate
> >> 1.229e+08 -19.0% 99590617 numa-vmstat.node0.workingset_refault
> >> 6345 ± 3% -76.5% 1489 ± 3% numa-vmstat.node1.nr_free_cma
> >> 41335 +32.0% 54552 numa-vmstat.node1.nr_page_table_pages
> >> 25770 ± 46% -80.8% 4956 ± 38% numa-vmstat.node1.nr_shmem
> >> 55684 +10.4% 61475 ± 2% numa-vmstat.node1.nr_slab_reclaimable
> >> 1.618e+08 ± 8% -47.6% 84846798 ± 17% numa-vmstat.node1.numa_hit
> >> 1.617e+08 ± 8% -47.6% 84676284 ± 17% numa-vmstat.node1.numa_local
> >> 19365342 ± 19% +187.2% 55620100 ± 2% numa-vmstat.node1.numa_miss
> >> 19534837 ± 19% +185.6% 55790654 ± 2% numa-vmstat.node1.numa_other
> >> 1.296e+08 -21.0% 1.024e+08 numa-vmstat.node1.workingset_refault
> >> 1.832e+12 -7.5% 1.694e+12 perf-stat.branch-instructions
> >> 0.25 -0.0 0.23 perf-stat.branch-miss-rate%
> >> 4.666e+09 -16.0% 3.918e+09 perf-stat.branch-misses
> >> 39.88 +1.1 40.98 perf-stat.cache-miss-rate%
> >> 2.812e+10 -11.6% 2.485e+10 perf-stat.cache-misses
> >> 7.051e+10 -14.0% 6.064e+10 perf-stat.cache-references
> >> 1260521 -6.1% 1183071 perf-stat.context-switches
> >> 1.87 +9.6% 2.05 perf-stat.cpi
> >> 6707 ± 2% -5.2% 6359 perf-stat.cpu-migrations
> >> 1.04 ± 11% -0.3 0.77 ± 4% perf-stat.dTLB-load-miss-rate%
> >> 2.365e+10 ± 7% -25.9% 1.751e+10 ± 9% perf-stat.dTLB-load-misses
> >> 1.05e+12 ± 4% -9.5% 9.497e+11 ± 2% perf-stat.dTLB-stores
> >> 28.16 +2.2 30.35 ± 2% perf-stat.iTLB-load-miss-rate%
> >> 2.56e+08 -10.4% 2.295e+08 perf-stat.iTLB-loads
> >> 8.974e+12 -9.2% 8.151e+12 perf-stat.instructions
> >> 89411 -8.8% 81529 perf-stat.instructions-per-iTLB-miss
> >> 0.54 -8.8% 0.49 perf-stat.ipc
> >> 5.748e+08 -16.4% 4.806e+08 perf-stat.major-faults
> >> 52.82 +5.8 58.61 ± 2% perf-stat.node-load-miss-rate%
> >> 7.206e+09 ± 2% -18.6% 5.867e+09 ± 3% perf-stat.node-loads
> >> 17.96 ± 8% +15.7 33.69 ± 2% perf-stat.node-store-miss-rate%
> >> 2.055e+09 ± 8% +65.1% 3.393e+09 ± 4% perf-stat.node-store-misses
> >> 9.391e+09 ± 2% -28.9% 6.675e+09 perf-stat.node-stores
> >> 5.753e+08 -16.4% 4.811e+08 perf-stat.page-faults
> >> 305865 -16.3% 256108 proc-vmstat.allocstall_movable
> >> 1923 ± 14% -72.1% 537.00 ± 12% proc-vmstat.allocstall_normal
> >> 0.00 +Inf% 1577 ± 67% proc-vmstat.compact_isolated
> >> 1005 ± 4% -65.8% 344.00 ± 7% proc-vmstat.kswapd_low_wmark_hit_quickly
> >> 320062 +23.2% 394374 ± 4% proc-vmstat.nr_active_file
> >> 6411 ± 2% -76.4% 1511 ± 4% proc-vmstat.nr_free_cma
> >> 277.00 ± 12% -51.4% 134.75 ± 52% proc-vmstat.nr_vmscan_immediate_reclaim
> >> 320049 +23.2% 394353 ± 4% proc-vmstat.nr_zone_active_file
> >> 71262212 ± 15% +110.3% 1.499e+08 ± 3% proc-vmstat.numa_foreign
> >> 5.042e+08 ± 2% -34.3% 3.314e+08 proc-vmstat.numa_hit
> >> 5.041e+08 ± 2% -34.3% 3.314e+08 proc-vmstat.numa_local
> >> 71262212 ± 15% +110.3% 1.499e+08 ± 3% proc-vmstat.numa_miss
> >> 71273176 ± 15% +110.3% 1.499e+08 ± 3% proc-vmstat.numa_other
> >> 1007 ± 4% -65.6% 346.25 ± 7% proc-vmstat.pageoutrun
> >> 23070268 -16.0% 19386190 proc-vmstat.pgalloc_dma32
> >> 5.525e+08 -16.7% 4.603e+08 proc-vmstat.pgalloc_normal
> >> 5.753e+08 -16.4% 4.812e+08 proc-vmstat.pgfault
> >> 5.751e+08 -16.3% 4.813e+08 proc-vmstat.pgfree
> >> 5.748e+08 -16.4% 4.806e+08 proc-vmstat.pgmajfault
> >> 2.299e+09 -16.4% 1.923e+09 proc-vmstat.pgpgin
> >> 8.396e+08 -17.8% 6.901e+08 proc-vmstat.pgscan_direct
> >> 3.018e+08 ± 2% -13.0% 2.627e+08 proc-vmstat.pgscan_kswapd
> >> 4.1e+08 -15.1% 3.48e+08 proc-vmstat.pgsteal_direct
> >> 1.542e+08 ± 3% -20.9% 1.22e+08 ± 3% proc-vmstat.pgsteal_kswapd
> >> 23514 ± 4% -23.1% 18076 ± 16% proc-vmstat.slabs_scanned
> >> 343040 ± 2% +40.3% 481253 ± 2% proc-vmstat.workingset_activate
> >> 2.525e+08 -20.1% 2.018e+08 proc-vmstat.workingset_refault
> >> 13.64 ± 3% -1.7 11.96 ± 2% perf-profile.calltrace.cycles-pp.ext4_mpage_readpages.filemap_fault.ext4_filemap_fault.__do_fault.__handle_mm_fault
> >> 11.67 ± 3% -1.4 10.29 ± 2% perf-profile.calltrace.cycles-pp.submit_bio.ext4_mpage_readpages.filemap_fault.ext4_filemap_fault.__do_fault
> >> 11.64 ± 3% -1.4 10.25 ± 2% perf-profile.calltrace.cycles-pp.generic_make_request.submit_bio.ext4_mpage_readpages.filemap_fault.ext4_filemap_fault
> >> 11.10 ± 3% -1.3 9.82 ± 2% perf-profile.calltrace.cycles-pp.pmem_make_request.generic_make_request.submit_bio.ext4_mpage_readpages.filemap_fault
> >> 9.21 ± 3% -1.2 8.04 ± 3% perf-profile.calltrace.cycles-pp.pmem_do_bvec.pmem_make_request.generic_make_request.submit_bio.ext4_mpage_readpages
> >> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
> >> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
> >> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
> >> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
> >> 26.79 ± 4% -0.8 25.98 ± 5% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
> >> 27.98 ± 3% -0.8 27.22 ± 4% perf-profile.calltrace.cycles-pp.secondary_startup_64
> >> 5.36 ± 12% -0.6 4.76 ± 7% perf-profile.calltrace.cycles-pp.kswapd.kthread.ret_from_fork
> >> 5.36 ± 12% -0.6 4.76 ± 7% perf-profile.calltrace.cycles-pp.shrink_node.kswapd.kthread.ret_from_fork
> >> 5.30 ± 12% -0.6 4.71 ± 7% perf-profile.calltrace.cycles-pp.shrink_inactive_list.shrink_node_memcg.shrink_node.kswapd.kthread
> >> 5.35 ± 12% -0.6 4.76 ± 7% perf-profile.calltrace.cycles-pp.shrink_node_memcg.shrink_node.kswapd.kthread.ret_from_fork
> >> 5.43 ± 12% -0.5 4.88 ± 7% perf-profile.calltrace.cycles-pp.ret_from_fork
> >> 5.43 ± 12% -0.5 4.88 ± 7% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
> >> 11.04 ± 2% -0.2 10.82 ± 2% perf-profile.calltrace.cycles-pp.shrink_page_list.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages
> >> 62.44 ± 2% +1.9 64.38 perf-profile.calltrace.cycles-pp.page_fault
> >> 62.38 ± 2% +2.0 64.33 perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
> >> 62.38 ± 2% +2.0 64.34 perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
> >> 61.52 ± 2% +2.1 63.58 perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
> >> 61.34 ± 2% +2.1 63.44 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
> >> 30.18 ± 3% +2.3 32.45 ± 2% perf-profile.calltrace.cycles-pp.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages
> >> 7.98 ± 3% +2.3 10.33 ± 2% perf-profile.calltrace.cycles-pp.add_to_page_cache_lru.filemap_fault.ext4_filemap_fault.__do_fault.__handle_mm_fault
> >> 30.48 ± 3% +2.4 32.83 ± 2% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.filemap_fault.ext4_filemap_fault
> >> 30.46 ± 3% +2.4 32.81 ± 2% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.filemap_fault
> >> 30.46 ± 3% +2.4 32.81 ± 2% perf-profile.calltrace.cycles-pp.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask
> >> 30.37 ± 3% +2.4 32.75 ± 2% perf-profile.calltrace.cycles-pp.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath
> >> 5.58 ± 4% +2.5 8.08 ± 2% perf-profile.calltrace.cycles-pp.__lru_cache_add.add_to_page_cache_lru.filemap_fault.ext4_filemap_fault.__do_fault
> >> 32.88 ± 3% +2.5 35.38 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.filemap_fault.ext4_filemap_fault.__do_fault.__handle_mm_fault
> >> 5.51 ± 4% +2.5 8.02 ± 2% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.__lru_cache_add.add_to_page_cache_lru.filemap_fault.ext4_filemap_fault
> >> 4.24 ± 4% +2.5 6.76 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.add_to_page_cache_lru.filemap_fault
> >> 4.18 ± 4% +2.5 6.70 ± 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.add_to_page_cache_lru
> >> 18.64 ± 3% +2.5 21.16 ± 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node
> >> 31.65 ± 3% +2.7 34.31 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.filemap_fault.ext4_filemap_fault.__do_fault
> >> 17.21 ± 3% +2.7 19.93 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages
> >
> >It looks like there is more lru lock contention. It would be caused by
> >using a movable zone for the CMA memory by this patch. In this case,
> >reclaim for normal memory skips the lru page on the movable zone so needs
> >more time to find enough reclaim target pages. It would increase lru lock
> >holding time and then cause contention.
> >
> >Could you give me another stat 'pgskip_XXX' in /proc/vmstat to confirm
> >my theory?
>
> Attached is the /proc/vmstat sample file during the test, sample interval is 1s.
Thanks!
pgskip_XXX is low so my theory would be wrong. The other theory is
that numa miss is the reason of the regression. Could you test the
same test on the system without numa? I cannot test it since I
don't have pmem.
Thanks.