Re: [LKP] [lkp] [xfs] 68a9f5e700: aim7.jobs-per-min -13.6% regression

From: Huang\, Ying
Date: Thu Aug 11 2016 - 13:51:54 EST


Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> writes:

> On Thu, Aug 11, 2016 at 8:57 AM, Christoph Hellwig <hch@xxxxxx> wrote:
>>
>> The one liner below (not tested yet) to simply remove it should fix that
>> up. I also noticed we have a spurious pagefault_disable/enable, I
>> need to dig into the history of that first, though.
>
> Hopefully the pagefault_disable/enable doesn't matter for this case.
>
> Can we get this one-liner tested with the kernel robot for comparison?
> I really think a messed-up LRU list could cause bad IO patterns, and
> end up keeping dirty pages around that should be streaming out to disk
> and re-used, so causing memory pressure etc for no good reason.
>
> I think the mapping->tree_lock issue that Dave sees is interesting
> too, but the kswapd activity (and the extra locking it causes) could
> also be a symptom of the same thing - memory pressure due to just
> putting pages in the active file that simply shouldn't be there.
>
> So the trivial oneliner _might_ just explain things. It would be
> really nice if the regression turns out to be due to something so
> easily fixed.
>

Here is the test result for the debug patch. It appears that the aim7
score is a little better, but the regression is not recovered.

commit 5c70fdfdf82723e47ac180e36a7638ca06ea19d8
Author: Christoph Hellwig <hch@xxxxxx>
Date: Thu Aug 11 17:57:21 2016 +0200

dbg fix 68a9f5e700: aim7.jobs-per-min -13.6% regression

On Wed, Aug 10, 2016 at 06:00:24PM -0700, Linus Torvalds wrote:
> The biggest difference is that we have "mark_page_accessed()" show up
> after, and not before. There was also a lot of LRU noise in the
> non-profile data. I wonder if that is the reason here: the old model
> of using generic_perform_write/block_page_mkwrite didn't mark the
> pages accessed, and now with iomap_file_buffered_write() they get
> marked as active and that screws up the LRU list, and makes us not
> flush out the dirty pages well (because they are seen as active and
> not good for writeback), and then you get bad memory use.

And that's actually a "bug" in the new code - mostly because I failed
to pick up changes to the core code happening after we 'forked' it,
in this case commit 2457ae ("mm: non-atomically mark page accessed during page
cache allocation where possible").

The one liner below (not tested yet) to simply remove it should fix that
up. I also noticed we have a spurious pagefault_disable/enable, I
need to dig into the history of that first, though.

diff --git a/fs/iomap.c b/fs/iomap.c
index 48141b8..f39c318 100644
--- a/fs/iomap.c
+++ b/fs/iomap.c
@@ -199,7 +199,6 @@ again:
pagefault_enable();

flush_dcache_page(page);
- mark_page_accessed(page);

status = iomap_write_end(inode, pos, bytes, copied, page);
if (unlikely(status < 0))


=========================================================================================
compiler/cpufreq_governor/disk/fs/kconfig/load/rootfs/tbox_group/test/testcase:
gcc-6/performance/1BRD_48G/xfs/x86_64-rhel/3000/debian-x86_64-2015-02-07.cgz/ivb44/disk_wrt/aim7

commit:
f0c6bcba74ac51cb77aadb33ad35cb2dc1ad1506
68a9f5e7007c1afa2cf6830b690a90d0187c0684
5c70fdfdf82723e47ac180e36a7638ca06ea19d8

f0c6bcba74ac51cb 68a9f5e7007c1afa2cf6830b69 5c70fdfdf82723e47ac180e36a
---------------- -------------------------- --------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
486586 Â 0% -13.6% 420523 Â 0% -11.2% 432165 Â 0% aim7.jobs-per-min
37.23 Â 0% +15.5% 43.02 Â 0% +12.4% 41.87 Â 0% aim7.time.elapsed_time
37.23 Â 0% +15.5% 43.02 Â 0% +12.4% 41.87 Â 0% aim7.time.elapsed_time.max
6424 Â 1% +31.3% 8437 Â 1% +24.4% 7992 Â 3% aim7.time.involuntary_context_switches
2489 Â 3% -2.5% 2425 Â 1% -1.0% 2465 Â 3% aim7.time.maximum_resident_set_size
151288 Â 0% +3.0% 155764 Â 0% +2.0% 154252 Â 1% aim7.time.minor_page_faults
376.31 Â 0% +28.4% 483.23 Â 0% +22.5% 460.96 Â 1% aim7.time.system_time
429058 Â 0% -20.1% 343013 Â 0% -16.7% 357231 Â 0% aim7.time.voluntary_context_switches
16014 Â 0% +28.3% 20545 Â 1% -4.0% 15369 Â 0% meminfo.Active(file)
127154 Â 9% -15.5% 107424 Â 10% -4.4% 121505 Â 11% softirqs.SCHED
24561 Â 0% -27.1% 17895 Â 1% -22.5% 19033 Â 1% vmstat.system.cs
47289 Â 0% +1.4% 47938 Â 0% +1.1% 47807 Â 0% vmstat.system.in
4003 Â 0% +27.8% 5117 Â 1% -4.0% 3842 Â 0% proc-vmstat.nr_active_file
979.25 Â 0% +59.1% 1558 Â 5% -11.8% 864.00 Â 1% proc-vmstat.pgactivate
4699 Â 3% +132.6% 10932 Â 78% +77.1% 8321 Â 36% proc-vmstat.pgpgout
7868 Â 1% +28.9% 10140 Â 6% -1.8% 7724 Â 1% numa-meminfo.node0.Active(file)
161402 Â 1% -21.0% 127504 Â 25% -1.7% 158708 Â 5% numa-meminfo.node0.Slab
8148 Â 1% +27.8% 10416 Â 7% -6.1% 7648 Â 1% numa-meminfo.node1.Active(file)
81041 Â 3% +43.8% 116540 Â 27% +5.7% 85631 Â 8% numa-meminfo.node1.Slab
13903 Â 18% +22.8% 17076 Â 9% -27.8% 10039 Â 12% numa-numastat.node0.numa_foreign
12525 Â 20% +27.1% 15922 Â 9% -26.8% 9170 Â 17% numa-numastat.node0.numa_miss
14084 Â 18% -35.4% 9102 Â 17% +16.5% 16401 Â 8% numa-numastat.node1.numa_foreign
15461 Â 17% -33.7% 10257 Â 14% +11.7% 17270 Â 9% numa-numastat.node1.numa_miss
1966 Â 1% +30.4% 2565 Â 4% -1.8% 1930 Â 1% numa-vmstat.node0.nr_active_file
4204 Â 3% +17.7% 4947 Â 7% +6.1% 4461 Â 6% numa-vmstat.node0.nr_alloc_batch
521.75 Â 4% +5.4% 549.80 Â 4% -12.0% 459.00 Â 12% numa-vmstat.node0.nr_dirty
2037 Â 1% +24.9% 2543 Â 5% -6.1% 1912 Â 1% numa-vmstat.node1.nr_active_file
24.26 Â 0% +8.7% 26.36 Â 0% +7.1% 25.99 Â 0% turbostat.%Busy
686.75 Â 0% +9.2% 749.80 Â 0% +6.5% 731.67 Â 0% turbostat.Avg_MHz
0.29 Â 1% -24.9% 0.22 Â 2% -18.8% 0.23 Â 2% turbostat.CPU%c3
91.39 Â 2% +3.5% 94.60 Â 0% +4.4% 95.45 Â 1% turbostat.CorWatt
121.88 Â 1% +2.6% 124.99 Â 0% +3.9% 126.57 Â 1% turbostat.PkgWatt
37.23 Â 0% +15.5% 43.02 Â 0% +12.4% 41.87 Â 0% time.elapsed_time
37.23 Â 0% +15.5% 43.02 Â 0% +12.4% 41.87 Â 0% time.elapsed_time.max
6424 Â 1% +31.3% 8437 Â 1% +24.4% 7992 Â 3% time.involuntary_context_switches
1038 Â 0% +10.5% 1148 Â 0% +8.4% 1126 Â 0% time.percent_of_cpu_this_job_got
376.31 Â 0% +28.4% 483.23 Â 0% +22.5% 460.96 Â 1% time.system_time
429058 Â 0% -20.1% 343013 Â 0% -16.7% 357231 Â 0% time.voluntary_context_switches
53643508 Â 0% -19.5% 43181295 Â 2% -16.4% 44822518 Â 0% cpuidle.C1-IVT.time
318952 Â 0% -25.7% 236993 Â 0% -21.3% 251084 Â 0% cpuidle.C1-IVT.usage
3471235 Â 2% -18.1% 2843694 Â 3% -12.9% 3022770 Â 1% cpuidle.C1E-IVT.time
46642 Â 1% -22.5% 36144 Â 0% -17.4% 38545 Â 0% cpuidle.C1E-IVT.usage
12601665 Â 1% -21.9% 9837996 Â 1% -18.9% 10218444 Â 0% cpuidle.C3-IVT.time
79872 Â 1% -19.7% 64163 Â 0% -16.8% 66477 Â 0% cpuidle.C3-IVT.usage
1.292e+09 Â 0% +13.7% 1.469e+09 Â 0% +10.9% 1.434e+09 Â 0% cpuidle.C6-IVT.time
5131 Â121% -98.5% 75.60 Â200% -93.3% 344.33 Â 98% latency_stats.avg.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
2966 Â 80% -27.5% 2151 Â 16% +359.2% 13624 Â 77% latency_stats.max.down.xfs_buf_lock._xfs_buf_find.xfs_buf_get_map.xfs_buf_read_map.xfs_trans_read_buf_map.xfs_read_agi.xfs_iunlink.xfs_droplink.xfs_remove.xfs_vn_unlink.vfs_unlink
5131 Â121% -98.5% 75.60 Â200% -93.3% 344.33 Â 98% latency_stats.max.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
9739 Â 99% -99.0% 95.40 Â 9% -99.1% 91.33 Â 5% latency_stats.max.submit_bio_wait.blkdev_issue_flush.ext4_sync_fs.sync_fs_one_sb.iterate_supers.sys_sync.entry_SYSCALL_64_fastpath
7739 Â 81% -71.5% 2202 Â 46% -77.4% 1752 Â 23% latency_stats.max.wait_on_page_bit.__filemap_fdatawait_range.filemap_fdatawait_keep_errors.sync_inodes_sb.sync_inodes_one_sb.iterate_supers.sys_sync.entry_SYSCALL_64_fastpath
5131 Â121% -98.5% 75.60 Â200% -93.3% 344.33 Â 98% latency_stats.sum.nfs_wait_on_request.nfs_updatepage.nfs_write_end.generic_perform_write.__generic_file_write_iter.generic_file_write_iter.nfs_file_write.__vfs_write.vfs_write.SyS_write.entry_SYSCALL_64_fastpath
10459 Â 97% -97.5% 263.80 Â 5% -97.6% 254.67 Â 1% latency_stats.sum.submit_bio_wait.blkdev_issue_flush.ext4_sync_fs.sync_fs_one_sb.iterate_supers.sys_sync.entry_SYSCALL_64_fastpath
9097 Â 81% -70.2% 2708 Â 40% -74.3% 2335 Â 21% latency_stats.sum.wait_on_page_bit.__filemap_fdatawait_range.filemap_fdatawait_keep_errors.sync_inodes_sb.sync_inodes_one_sb.iterate_supers.sys_sync.entry_SYSCALL_64_fastpath
229.50 Â 11% -6.7% 214.20 Â 46% -77.8% 51.00 Â 0% slabinfo.dio.active_objs
229.50 Â 11% -6.7% 214.20 Â 46% -77.8% 51.00 Â 0% slabinfo.dio.num_objs
369.75 Â 17% -6.2% 346.80 Â 11% -21.8% 289.00 Â 16% slabinfo.kmem_cache.active_objs
369.75 Â 17% -6.2% 346.80 Â 11% -21.8% 289.00 Â 16% slabinfo.kmem_cache.num_objs
1022 Â 5% -2.9% 992.80 Â 11% -11.9% 900.33 Â 3% slabinfo.nsproxy.active_objs
1022 Â 5% -2.9% 992.80 Â 11% -11.9% 900.33 Â 3% slabinfo.nsproxy.num_objs
1836 Â 10% +60.6% 2949 Â 9% +45.1% 2665 Â 6% slabinfo.scsi_data_buffer.active_objs
1836 Â 10% +60.6% 2949 Â 9% +45.1% 2665 Â 6% slabinfo.scsi_data_buffer.num_objs
431.75 Â 10% +60.7% 693.80 Â 9% +45.1% 626.67 Â 6% slabinfo.xfs_efd_item.active_objs
431.75 Â 10% +60.7% 693.80 Â 9% +45.1% 626.67 Â 6% slabinfo.xfs_efd_item.num_objs
2.59e+11 Â 6% +22.4% 3.169e+11 Â 5% +22.9% 3.182e+11 Â 7% perf-stat.branch-instructions
0.41 Â 2% -9.2% 0.38 Â 1% -25.5% 0.31 Â 3% perf-stat.branch-miss-rate
1.072e+09 Â 4% +11.3% 1.193e+09 Â 3% -8.5% 9.812e+08 Â 4% perf-stat.branch-misses
972882 Â 0% -17.4% 803235 Â 0% -14.3% 833753 Â 0% perf-stat.context-switches
1.472e+12 Â 6% +20.5% 1.774e+12 Â 5% +16.7% 1.717e+12 Â 7% perf-stat.cpu-cycles
100350 Â 1% -5.2% 95091 Â 1% -3.8% 96553 Â 0% perf-stat.cpu-migrations
7.315e+08 Â 24% +54.9% 1.133e+09 Â 35% +107.1% 1.515e+09 Â 64% perf-stat.dTLB-load-misses
3.225e+11 Â 5% +38.1% 4.454e+11 Â 3% +32.2% 4.263e+11 Â 6% perf-stat.dTLB-loads
2.176e+11 Â 9% +45.2% 3.16e+11 Â 5% +50.1% 3.267e+11 Â 7% perf-stat.dTLB-stores
1.452e+12 Â 6% +27.7% 1.853e+12 Â 5% +28.0% 1.858e+12 Â 7% perf-stat.instructions
42168 Â 16% +24.9% 52676 Â 7% +33.4% 56234 Â 6% perf-stat.instructions-per-iTLB-miss
0.99 Â 0% +5.9% 1.05 Â 0% +9.6% 1.08 Â 0% perf-stat.ipc
252401 Â 0% +6.7% 269219 Â 0% +5.2% 265524 Â 0% perf-stat.minor-faults
10.16 Â 3% +14.5% 11.63 Â 4% +12.3% 11.41 Â 3% perf-stat.node-store-miss-rate
24842185 Â 2% +11.0% 27573897 Â 2% +7.6% 26720271 Â 3% perf-stat.node-store-misses
2.198e+08 Â 2% -4.5% 2.1e+08 Â 5% -5.6% 2.076e+08 Â 2% perf-stat.node-stores
252321 Â 0% +6.6% 269092 Â 0% +5.2% 265415 Â 0% perf-stat.page-faults
4.08 Â 64% +63.1% 6.65 Â 77% +609.5% 28.93 Â 92% sched_debug.cfs_rq:/.MIN_vruntime.avg
157.17 Â 61% +30.0% 204.25 Â 47% +686.9% 1236 Â112% sched_debug.cfs_rq:/.MIN_vruntime.max
4.08 Â 64% +63.1% 6.65 Â 77% +609.5% 28.93 Â 92% sched_debug.cfs_rq:/.max_vruntime.avg
157.17 Â 61% +30.0% 204.25 Â 47% +686.9% 1236 Â112% sched_debug.cfs_rq:/.max_vruntime.max
191.00 Â 35% -23.4% 146.40 Â 15% -28.4% 136.67 Â 5% sched_debug.cfs_rq:/.runnable_load_avg.max
44.43 Â 16% -20.2% 35.47 Â 9% -25.2% 33.25 Â 1% sched_debug.cfs_rq:/.runnable_load_avg.stddev
50.23 Â 19% -27.1% 36.61 Â 15% -30.3% 35.03 Â 15% sched_debug.cpu.cpu_load[1].avg
466.50 Â 29% -55.6% 207.20 Â 73% -47.5% 245.00 Â 67% sched_debug.cpu.cpu_load[1].max
77.78 Â 33% -52.4% 37.05 Â 53% -48.8% 39.86 Â 51% sched_debug.cpu.cpu_load[1].stddev
38.82 Â 19% -24.2% 29.42 Â 15% -29.6% 27.31 Â 10% sched_debug.cpu.cpu_load[2].avg
300.50 Â 33% -55.1% 135.00 Â 46% -50.2% 149.67 Â 46% sched_debug.cpu.cpu_load[2].max
51.71 Â 41% -49.3% 26.20 Â 46% -50.9% 25.39 Â 33% sched_debug.cpu.cpu_load[2].stddev
27.08 Â 19% -22.4% 21.02 Â 15% -27.8% 19.56 Â 11% sched_debug.cpu.cpu_load[3].avg
185.00 Â 43% -49.6% 93.20 Â 44% -48.3% 95.67 Â 30% sched_debug.cpu.cpu_load[3].max
32.91 Â 50% -46.1% 17.75 Â 45% -49.8% 16.51 Â 25% sched_debug.cpu.cpu_load[3].stddev
17.78 Â 20% -20.9% 14.06 Â 13% -25.3% 13.28 Â 13% sched_debug.cpu.cpu_load[4].avg
20.29 Â 55% -44.5% 11.26 Â 39% -47.7% 10.61 Â 27% sched_debug.cpu.cpu_load[4].stddev
4929 Â 18% -24.8% 3704 Â 23% -4.5% 4708 Â 21% sched_debug.cpu.nr_load_updates.avg
276.50 Â 10% -4.4% 264.20 Â 7% -14.3% 237.00 Â 19% sched_debug.cpu.nr_switches.min

Best Regards,
Huang, Ying