Re: [lkp-robot] [mm/cma] 2b0f904a5a: fio.read_bw_MBps -16.1% regression

From: Joonsoo Kim
Date: Tue Jan 02 2018 - 21:04:39 EST


Hello!

On Tue, Jan 02, 2018 at 02:35:28PM +0800, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed a -16.1% regression of fio.read_bw_MBps due to commit:
>
>
> commit: 2b0f904a5a8781498417d67226fd12c5e56053ae ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: fio-basic
> on test machine: 56 threads Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz with 256G memory
> with following parameters:
>
> disk: 2pmem
> fs: ext4
> runtime: 200s
> nr_task: 50%
> time_based: tb
> rw: randread
> bs: 2M
> ioengine: mmap
> test_size: 200G
> cpufreq_governor: performance
>
> test-description: Fio is a tool that will spawn a number of threads or processes doing a particular type of I/O action as specified by the user.
> test-url: https://github.com/axboe/fio
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> bs/compiler/cpufreq_governor/disk/fs/ioengine/kconfig/nr_task/rootfs/runtime/rw/tbox_group/test_size/testcase/time_based:
> 2M/gcc-7/performance/2pmem/ext4/mmap/x86_64-rhel-7.2/50%/debian-x86_64-2016-08-31.cgz/200s/randread/lkp-hsw-ep6/200G/fio-basic/tb
>
> commit:
> f6572f9cd2 ("mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request")
> 2b0f904a5a ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
>
> f6572f9cd248df2c 2b0f904a5a8781498417d67226
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 11451 -16.1% 9605 fio.read_bw_MBps
> 0.29 ± 5% +0.1 0.40 ± 3% fio.latency_1000us%
> 19.35 ± 5% -4.7 14.69 ± 3% fio.latency_10ms%
> 7.92 ± 3% +12.2 20.15 fio.latency_20ms%
> 0.05 ± 11% +0.0 0.09 ± 8% fio.latency_2ms%
> 70.22 -8.9 61.36 fio.latency_4ms%
> 0.29 ± 13% +0.0 0.33 ± 3% fio.latency_500us%
> 0.45 ± 29% +1.0 1.45 ± 4% fio.latency_50ms%
> 1.37 +0.1 1.44 fio.latency_750us%
> 9792 +31.7% 12896 fio.read_clat_90%_us
> 10560 +33.0% 14048 fio.read_clat_95%_us
> 15376 ± 10% +46.9% 22592 fio.read_clat_99%_us
> 4885 +19.2% 5825 fio.read_clat_mean_us
> 5725 -16.1% 4802 fio.read_iops
> 4.598e+09 -16.4% 3.845e+09 fio.time.file_system_inputs
> 453153 -8.4% 415215 fio.time.involuntary_context_switches
> 5.748e+08 -16.4% 4.806e+08 fio.time.major_page_faults
> 1822257 +23.7% 2254706 fio.time.maximum_resident_set_size
> 5089 +1.6% 5172 fio.time.system_time
> 514.50 -16.3% 430.48 fio.time.user_time

System time is increased and user time is decreased. On the below, there is a clue.

> 24569 ± 2% +9.6% 26917 ± 2% fio.time.voluntary_context_switches
> 54443725 -14.9% 46353339 interrupts.CAL:Function_call_interrupts
> 0.00 ± 79% -0.0 0.00 ± 17% mpstat.cpu.iowait%
> 4.45 -0.7 3.71 mpstat.cpu.usr%
> 1467516 +21.3% 1779543 ± 3% meminfo.Active
> 1276031 +23.7% 1578443 ± 4% meminfo.Active(file)
> 25789 ± 3% -76.7% 6013 ± 4% meminfo.CmaFree
> 1.296e+08 -12.6% 1.133e+08 turbostat.IRQ
> 41.89 -3.4% 40.47 turbostat.RAMWatt
> 17444 ± 2% -13.5% 15092 ± 3% turbostat.SMI
> 10896428 -16.4% 9111830 vmstat.io.bi
> 6010 -6.2% 5637 vmstat.system.cs
> 317438 -12.1% 278980 vmstat.system.in
> 1072892 ± 3% +21.5% 1303487 numa-meminfo.node0.Active
> 978318 +21.6% 1189809 ± 2% numa-meminfo.node0.Active(file)
> 222968 -25.2% 166818 numa-meminfo.node0.PageTables
> 47374 ± 2% +10.6% 52402 ± 7% numa-meminfo.node0.SUnreclaim
> 165213 +31.9% 217870 numa-meminfo.node1.PageTables
> 222405 +10.4% 245633 ± 2% numa-meminfo.node1.SReclaimable
> 102992 ± 46% -80.8% 19812 ± 38% numa-meminfo.node1.Shmem
> 2.475e+08 ± 2% -24.0% 1.881e+08 numa-numastat.node0.local_node
> 39371795 ± 14% +167.1% 1.052e+08 ± 2% numa-numastat.node0.numa_foreign
> 2.475e+08 ± 2% -24.0% 1.881e+08 numa-numastat.node0.numa_hit
> 31890417 ± 17% +40.2% 44705135 ± 8% numa-numastat.node0.numa_miss
> 31899482 ± 17% +40.2% 44713255 ± 8% numa-numastat.node0.other_node
> 2.566e+08 ± 2% -44.2% 1.433e+08 numa-numastat.node1.local_node
> 31890417 ± 17% +40.2% 44705135 ± 8% numa-numastat.node1.numa_foreign
> 2.566e+08 ± 2% -44.2% 1.433e+08 numa-numastat.node1.numa_hit
> 39371795 ± 14% +167.1% 1.052e+08 ± 2% numa-numastat.node1.numa_miss
> 39373660 ± 14% +167.1% 1.052e+08 ± 2% numa-numastat.node1.other_node
> 6047 ± 39% -66.5% 2028 ± 63% sched_debug.cfs_rq:/.exec_clock.min
> 461.37 ± 8% +64.9% 760.74 ± 20% sched_debug.cfs_rq:/.load_avg.avg
> 1105 ± 13% +1389.3% 16467 ± 56% sched_debug.cfs_rq:/.load_avg.max
> 408.99 ± 3% +495.0% 2433 ± 49% sched_debug.cfs_rq:/.load_avg.stddev
> 28746 ± 12% -18.7% 23366 ± 14% sched_debug.cfs_rq:/.min_vruntime.min
> 752426 ± 3% -12.7% 656636 ± 4% sched_debug.cpu.avg_idle.avg
> 144956 ± 61% -85.4% 21174 ± 26% sched_debug.cpu.avg_idle.min
> 245684 ± 11% +44.6% 355257 ± 2% sched_debug.cpu.avg_idle.stddev
> 236035 ± 15% +51.8% 358264 ± 16% sched_debug.cpu.nr_switches.max
> 42039 ± 22% +34.7% 56616 ± 8% sched_debug.cpu.nr_switches.stddev
> 3204 ± 24% -48.1% 1663 ± 30% sched_debug.cpu.sched_count.min
> 2132 ± 25% +38.7% 2957 ± 11% sched_debug.cpu.sched_count.stddev
> 90.67 ± 32% -71.8% 25.58 ± 26% sched_debug.cpu.sched_goidle.min
> 6467 ± 15% +22.3% 7912 ± 15% sched_debug.cpu.ttwu_count.max
> 1513 ± 27% -55.7% 670.92 ± 22% sched_debug.cpu.ttwu_count.min
> 1025 ± 20% +68.4% 1727 ± 9% sched_debug.cpu.ttwu_count.stddev
> 1057 ± 16% -62.9% 391.85 ± 31% sched_debug.cpu.ttwu_local.min
> 244876 +21.6% 297770 ± 2% numa-vmstat.node0.nr_active_file
> 88.00 ± 5% +19.3% 105.00 ± 5% numa-vmstat.node0.nr_isolated_file
> 55778 -25.1% 41765 numa-vmstat.node0.nr_page_table_pages
> 11843 ± 2% +10.6% 13100 ± 7% numa-vmstat.node0.nr_slab_unreclaimable
> 159.25 ± 42% -74.9% 40.00 ± 52% numa-vmstat.node0.nr_vmscan_immediate_reclaim
> 244862 +21.6% 297739 ± 2% numa-vmstat.node0.nr_zone_active_file
> 19364320 ± 19% +187.2% 55617595 ± 2% numa-vmstat.node0.numa_foreign
> 268155 ± 3% +49.6% 401089 ± 4% numa-vmstat.node0.workingset_activate
> 1.229e+08 -19.0% 99590617 numa-vmstat.node0.workingset_refault
> 6345 ± 3% -76.5% 1489 ± 3% numa-vmstat.node1.nr_free_cma
> 41335 +32.0% 54552 numa-vmstat.node1.nr_page_table_pages
> 25770 ± 46% -80.8% 4956 ± 38% numa-vmstat.node1.nr_shmem
> 55684 +10.4% 61475 ± 2% numa-vmstat.node1.nr_slab_reclaimable
> 1.618e+08 ± 8% -47.6% 84846798 ± 17% numa-vmstat.node1.numa_hit
> 1.617e+08 ± 8% -47.6% 84676284 ± 17% numa-vmstat.node1.numa_local
> 19365342 ± 19% +187.2% 55620100 ± 2% numa-vmstat.node1.numa_miss
> 19534837 ± 19% +185.6% 55790654 ± 2% numa-vmstat.node1.numa_other
> 1.296e+08 -21.0% 1.024e+08 numa-vmstat.node1.workingset_refault
> 1.832e+12 -7.5% 1.694e+12 perf-stat.branch-instructions
> 0.25 -0.0 0.23 perf-stat.branch-miss-rate%
> 4.666e+09 -16.0% 3.918e+09 perf-stat.branch-misses
> 39.88 +1.1 40.98 perf-stat.cache-miss-rate%
> 2.812e+10 -11.6% 2.485e+10 perf-stat.cache-misses
> 7.051e+10 -14.0% 6.064e+10 perf-stat.cache-references
> 1260521 -6.1% 1183071 perf-stat.context-switches
> 1.87 +9.6% 2.05 perf-stat.cpi
> 6707 ± 2% -5.2% 6359 perf-stat.cpu-migrations
> 1.04 ± 11% -0.3 0.77 ± 4% perf-stat.dTLB-load-miss-rate%
> 2.365e+10 ± 7% -25.9% 1.751e+10 ± 9% perf-stat.dTLB-load-misses
> 1.05e+12 ± 4% -9.5% 9.497e+11 ± 2% perf-stat.dTLB-stores
> 28.16 +2.2 30.35 ± 2% perf-stat.iTLB-load-miss-rate%
> 2.56e+08 -10.4% 2.295e+08 perf-stat.iTLB-loads
> 8.974e+12 -9.2% 8.151e+12 perf-stat.instructions
> 89411 -8.8% 81529 perf-stat.instructions-per-iTLB-miss
> 0.54 -8.8% 0.49 perf-stat.ipc
> 5.748e+08 -16.4% 4.806e+08 perf-stat.major-faults
> 52.82 +5.8 58.61 ± 2% perf-stat.node-load-miss-rate%
> 7.206e+09 ± 2% -18.6% 5.867e+09 ± 3% perf-stat.node-loads
> 17.96 ± 8% +15.7 33.69 ± 2% perf-stat.node-store-miss-rate%
> 2.055e+09 ± 8% +65.1% 3.393e+09 ± 4% perf-stat.node-store-misses
> 9.391e+09 ± 2% -28.9% 6.675e+09 perf-stat.node-stores
> 5.753e+08 -16.4% 4.811e+08 perf-stat.page-faults
> 305865 -16.3% 256108 proc-vmstat.allocstall_movable
> 1923 ± 14% -72.1% 537.00 ± 12% proc-vmstat.allocstall_normal
> 0.00 +Inf% 1577 ± 67% proc-vmstat.compact_isolated
> 1005 ± 4% -65.8% 344.00 ± 7% proc-vmstat.kswapd_low_wmark_hit_quickly
> 320062 +23.2% 394374 ± 4% proc-vmstat.nr_active_file
> 6411 ± 2% -76.4% 1511 ± 4% proc-vmstat.nr_free_cma
> 277.00 ± 12% -51.4% 134.75 ± 52% proc-vmstat.nr_vmscan_immediate_reclaim
> 320049 +23.2% 394353 ± 4% proc-vmstat.nr_zone_active_file
> 71262212 ± 15% +110.3% 1.499e+08 ± 3% proc-vmstat.numa_foreign
> 5.042e+08 ± 2% -34.3% 3.314e+08 proc-vmstat.numa_hit
> 5.041e+08 ± 2% -34.3% 3.314e+08 proc-vmstat.numa_local
> 71262212 ± 15% +110.3% 1.499e+08 ± 3% proc-vmstat.numa_miss
> 71273176 ± 15% +110.3% 1.499e+08 ± 3% proc-vmstat.numa_other
> 1007 ± 4% -65.6% 346.25 ± 7% proc-vmstat.pageoutrun
> 23070268 -16.0% 19386190 proc-vmstat.pgalloc_dma32
> 5.525e+08 -16.7% 4.603e+08 proc-vmstat.pgalloc_normal
> 5.753e+08 -16.4% 4.812e+08 proc-vmstat.pgfault
> 5.751e+08 -16.3% 4.813e+08 proc-vmstat.pgfree
> 5.748e+08 -16.4% 4.806e+08 proc-vmstat.pgmajfault
> 2.299e+09 -16.4% 1.923e+09 proc-vmstat.pgpgin
> 8.396e+08 -17.8% 6.901e+08 proc-vmstat.pgscan_direct
> 3.018e+08 ± 2% -13.0% 2.627e+08 proc-vmstat.pgscan_kswapd
> 4.1e+08 -15.1% 3.48e+08 proc-vmstat.pgsteal_direct
> 1.542e+08 ± 3% -20.9% 1.22e+08 ± 3% proc-vmstat.pgsteal_kswapd
> 23514 ± 4% -23.1% 18076 ± 16% proc-vmstat.slabs_scanned
> 343040 ± 2% +40.3% 481253 ± 2% proc-vmstat.workingset_activate
> 2.525e+08 -20.1% 2.018e+08 proc-vmstat.workingset_refault
> 13.64 ± 3% -1.7 11.96 ± 2% perf-profile.calltrace.cycles-pp.ext4_mpage_readpages.filemap_fault.ext4_filemap_fault.__do_fault.__handle_mm_fault
> 11.67 ± 3% -1.4 10.29 ± 2% perf-profile.calltrace.cycles-pp.submit_bio.ext4_mpage_readpages.filemap_fault.ext4_filemap_fault.__do_fault
> 11.64 ± 3% -1.4 10.25 ± 2% perf-profile.calltrace.cycles-pp.generic_make_request.submit_bio.ext4_mpage_readpages.filemap_fault.ext4_filemap_fault
> 11.10 ± 3% -1.3 9.82 ± 2% perf-profile.calltrace.cycles-pp.pmem_make_request.generic_make_request.submit_bio.ext4_mpage_readpages.filemap_fault
> 9.21 ± 3% -1.2 8.04 ± 3% perf-profile.calltrace.cycles-pp.pmem_do_bvec.pmem_make_request.generic_make_request.submit_bio.ext4_mpage_readpages
> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.cpu_startup_entry.start_secondary.secondary_startup_64
> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary.secondary_startup_64
> 27.33 ± 4% -1.0 26.35 ± 5% perf-profile.calltrace.cycles-pp.start_secondary.secondary_startup_64
> 26.79 ± 4% -0.8 25.98 ± 5% perf-profile.calltrace.cycles-pp.intel_idle.cpuidle_enter_state.do_idle.cpu_startup_entry.start_secondary
> 27.98 ± 3% -0.8 27.22 ± 4% perf-profile.calltrace.cycles-pp.secondary_startup_64
> 5.36 ± 12% -0.6 4.76 ± 7% perf-profile.calltrace.cycles-pp.kswapd.kthread.ret_from_fork
> 5.36 ± 12% -0.6 4.76 ± 7% perf-profile.calltrace.cycles-pp.shrink_node.kswapd.kthread.ret_from_fork
> 5.30 ± 12% -0.6 4.71 ± 7% perf-profile.calltrace.cycles-pp.shrink_inactive_list.shrink_node_memcg.shrink_node.kswapd.kthread
> 5.35 ± 12% -0.6 4.76 ± 7% perf-profile.calltrace.cycles-pp.shrink_node_memcg.shrink_node.kswapd.kthread.ret_from_fork
> 5.43 ± 12% -0.5 4.88 ± 7% perf-profile.calltrace.cycles-pp.ret_from_fork
> 5.43 ± 12% -0.5 4.88 ± 7% perf-profile.calltrace.cycles-pp.kthread.ret_from_fork
> 11.04 ± 2% -0.2 10.82 ± 2% perf-profile.calltrace.cycles-pp.shrink_page_list.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages
> 62.44 ± 2% +1.9 64.38 perf-profile.calltrace.cycles-pp.page_fault
> 62.38 ± 2% +2.0 64.33 perf-profile.calltrace.cycles-pp.__do_page_fault.do_page_fault.page_fault
> 62.38 ± 2% +2.0 64.34 perf-profile.calltrace.cycles-pp.do_page_fault.page_fault
> 61.52 ± 2% +2.1 63.58 perf-profile.calltrace.cycles-pp.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
> 61.34 ± 2% +2.1 63.44 perf-profile.calltrace.cycles-pp.__handle_mm_fault.handle_mm_fault.__do_page_fault.do_page_fault.page_fault
> 30.18 ± 3% +2.3 32.45 ± 2% perf-profile.calltrace.cycles-pp.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages
> 7.98 ± 3% +2.3 10.33 ± 2% perf-profile.calltrace.cycles-pp.add_to_page_cache_lru.filemap_fault.ext4_filemap_fault.__do_fault.__handle_mm_fault
> 30.48 ± 3% +2.4 32.83 ± 2% perf-profile.calltrace.cycles-pp.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.filemap_fault.ext4_filemap_fault
> 30.46 ± 3% +2.4 32.81 ± 2% perf-profile.calltrace.cycles-pp.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask.filemap_fault
> 30.46 ± 3% +2.4 32.81 ± 2% perf-profile.calltrace.cycles-pp.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath.__alloc_pages_nodemask
> 30.37 ± 3% +2.4 32.75 ± 2% perf-profile.calltrace.cycles-pp.shrink_node_memcg.shrink_node.do_try_to_free_pages.try_to_free_pages.__alloc_pages_slowpath
> 5.58 ± 4% +2.5 8.08 ± 2% perf-profile.calltrace.cycles-pp.__lru_cache_add.add_to_page_cache_lru.filemap_fault.ext4_filemap_fault.__do_fault
> 32.88 ± 3% +2.5 35.38 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages_nodemask.filemap_fault.ext4_filemap_fault.__do_fault.__handle_mm_fault
> 5.51 ± 4% +2.5 8.02 ± 2% perf-profile.calltrace.cycles-pp.pagevec_lru_move_fn.__lru_cache_add.add_to_page_cache_lru.filemap_fault.ext4_filemap_fault
> 4.24 ± 4% +2.5 6.76 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.add_to_page_cache_lru.filemap_fault
> 4.18 ± 4% +2.5 6.70 ± 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.add_to_page_cache_lru
> 18.64 ± 3% +2.5 21.16 ± 2% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node
> 31.65 ± 3% +2.7 34.31 ± 2% perf-profile.calltrace.cycles-pp.__alloc_pages_slowpath.__alloc_pages_nodemask.filemap_fault.ext4_filemap_fault.__do_fault
> 17.21 ± 3% +2.7 19.93 ± 2% perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.shrink_inactive_list.shrink_node_memcg.shrink_node.do_try_to_free_pages

It looks like there is more lru lock contention. It would be caused by
using a movable zone for the CMA memory by this patch. In this case,
reclaim for normal memory skips the lru page on the movable zone so needs
more time to find enough reclaim target pages. It would increase lru lock
holding time and then cause contention.

Could you give me another stat 'pgskip_XXX' in /proc/vmstat to confirm
my theory?

BTW, other stats for memory management looks better than before,
less nr_free_cma, less pgmajfault, less reclaim scan/steal.

Thanks.