Re: [PATCH 0/3] Removal of lumpy reclaim V2
From: Ying Han
Date: Wed Apr 11 2012 - 19:37:04 EST
On Wed, Apr 11, 2012 at 9:38 AM, Mel Gorman <mgorman@xxxxxxx> wrote:
> Andrew, these three patches should replace the two lumpy reclaim patches
> you already have. When applied, there is no functional difference (slightly
> changes in layout) but the changelogs are better.
>
> Changelog since V1
> o Ying pointed out that compaction was waiting on page writeback and the
> description of the patches in V1 was broken. This version is the same
> except that it is structured differently to explain that waiting on
> page writeback is removed.
> o Rebased to v3.4-rc2
>
> This series removes lumpy reclaim and some stalling logic that was
> unintentionally being used by memory compaction. The end result
> is that stalling on dirty pages during page reclaim now depends on
> wait_iff_congested().
>
> Four kernels were compared
>
> 3.3.0 vanilla
> 3.4.0-rc2 vanilla
> 3.4.0-rc2 lumpyremove-v2 is patch one from this series
> 3.4.0-rc2 nosync-v2r3 is the full series
>
> Removing lumpy reclaim saves almost 900K of text where as the full series
> removes 1200K of text.
>
> text data bss dec hex filename
> 6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla
> 6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2
> 6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2
>
> There are behaviour changes in the series and so tests were run with
> monitoring of ftrace events. This disrupts results so the performance
> results are distorted but the new behaviour should be clearer.
>
> fs-mark running in a threaded configuration showed little of interest as
> it did not push reclaim aggressively
>
> FS-Mark Multi Threaded
> 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
> Files/s min 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%)
> Files/s mean 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%)
> Files/s stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
> Files/s max 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%)
> Overhead min 508667.00 ( 0.00%) 521350.00 (-2.49%) 544292.00 (-7.00%) 547168.00 (-7.57%)
> Overhead mean 551185.00 ( 0.00%) 652690.73 (-18.42%) 991208.40 (-79.83%) 570130.53 (-3.44%)
> Overhead stddev 18200.69 ( 0.00%) 331958.29 (-1723.88%) 1579579.43 (-8578.68%) 9576.81 (47.38%)
> Overhead max 576775.00 ( 0.00%) 1846634.00 (-220.17%) 6901055.00 (-1096.49%) 585675.00 (-1.54%)
> MMTests Statistics: duration
> Sys Time Running Test (seconds) 309.90 300.95 307.33 298.95
> User+Sys Time Running Test (seconds) 319.32 309.67 315.69 307.51
> Total Elapsed Time (seconds) 1187.85 1193.09 1191.98 1193.73
>
> MMTests Statistics: vmstat
> Page Ins 80532 82212 81420 79480
> Page Outs 111434984 111456240 111437376 111582628
> Swap Ins 0 0 0 0
> Swap Outs 0 0 0 0
> Direct pages scanned 44881 27889 27453 34843
> Kswapd pages scanned 25841428 25860774 25861233 25843212
> Kswapd pages reclaimed 25841393 25860741 25861199 25843179
> Direct pages reclaimed 44881 27889 27453 34843
> Kswapd efficiency 99% 99% 99% 99%
> Kswapd velocity 21754.791 21675.460 21696.029 21649.127
> Direct efficiency 100% 100% 100% 100%
> Direct velocity 37.783 23.375 23.031 29.188
> Percentage direct scans 0% 0% 0% 0%
>
> ftrace showed that there was no stalling on writeback or pages submitted
> for IO from reclaim context.
>
>
> postmark was similar and while it was more interesting, it also did not
> push reclaim heavily.
>
> POSTMARK
> 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
> Transactions per second: 16.00 ( 0.00%) 20.00 (25.00%) 18.00 (12.50%) 17.00 ( 6.25%)
> Data megabytes read per second: 18.80 ( 0.00%) 24.27 (29.10%) 22.26 (18.40%) 20.54 ( 9.26%)
> Data megabytes written per second: 35.83 ( 0.00%) 46.25 (29.08%) 42.42 (18.39%) 39.14 ( 9.24%)
> Files created alone per second: 28.00 ( 0.00%) 38.00 (35.71%) 34.00 (21.43%) 30.00 ( 7.14%)
> Files create/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%)
> Files deleted alone per second: 556.00 ( 0.00%) 1224.00 (120.14%) 3062.00 (450.72%) 6124.00 (1001.44%)
> Files delete/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%)
>
> MMTests Statistics: duration
> Sys Time Running Test (seconds) 113.34 107.99 109.73 108.72
> User+Sys Time Running Test (seconds) 145.51 139.81 143.32 143.55
> Total Elapsed Time (seconds) 1159.16 899.23 980.17 1062.27
>
> MMTests Statistics: vmstat
> Page Ins 13710192 13729032 13727944 13760136
> Page Outs 43071140 42987228 42733684 42931624
> Swap Ins 0 0 0 0
> Swap Outs 0 0 0 0
> Direct pages scanned 0 0 0 0
> Kswapd pages scanned 9941613 9937443 9939085 9929154
> Kswapd pages reclaimed 9940926 9936751 9938397 9928465
> Direct pages reclaimed 0 0 0 0
> Kswapd efficiency 99% 99% 99% 99%
> Kswapd velocity 8576.567 11051.058 10140.164 9347.109
> Direct efficiency 100% 100% 100% 100%
> Direct velocity 0.000 0.000 0.000 0.000
>
> It looks like here that the full series regresses performance but as ftrace
> showed no usage of wait_iff_congested() or sync reclaim I am assuming it's
> a disruption due to monitoring. Other data such as memory usage, page IO,
> swap IO all looked similar.
>
> Running a benchmark with a plain DD showed nothing very interesting. The
> full series stalled in wait_iff_congested() slightly less but stall times
> on vanilla kernels were marginal.
>
> Running a benchmark that hammered on file-backed mappings showed stalls
> due to congestion but not in sync writebacks
>
> MICRO
> 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
> MMTests Statistics: duration
> Sys Time Running Test (seconds) 308.13 294.50 298.75 299.53
> User+Sys Time Running Test (seconds) 330.45 316.28 318.93 320.79
> Total Elapsed Time (seconds) 1814.90 1833.88 1821.14 1832.91
>
> MMTests Statistics: vmstat
> Page Ins 108712 120708 97224 110344
> Page Outs 155514576 156017404 155813676 156193256
> Swap Ins 0 0 0 0
> Swap Outs 0 0 0 0
> Direct pages scanned 2599253 1550480 2512822 2414760
> Kswapd pages scanned 69742364 71150694 68839041 69692533
> Kswapd pages reclaimed 34824488 34773341 34796602 34799396
> Direct pages reclaimed 53693 94750 61792 75205
> Kswapd efficiency 49% 48% 50% 49%
> Kswapd velocity 38427.662 38797.901 37799.972 38022.889
> Direct efficiency 2% 6% 2% 3%
> Direct velocity 1432.174 845.464 1379.807 1317.446
> Percentage direct scans 3% 2% 3% 3%
> Page writes by reclaim 0 0 0 0
> Page writes file 0 0 0 0
> Page writes anon 0 0 0 0
> Page reclaim immediate 0 0 0 1218
> Page rescued immediate 0 0 0 0
> Slabs scanned 15360 16384 13312 16384
> Direct inode steals 0 0 0 0
> Kswapd inode steals 4340 4327 1630 4323
>
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest waited 0 0 0 0
> Direct time congest waited 0ms 0ms 0ms 0ms
> Direct full congest waited 0 0 0 0
> Direct number conditional waited 900 870 754 789
> Direct time conditional waited 0ms 0ms 0ms 20ms
> Direct full conditional waited 0 0 0 0
> KSwapd number congest waited 2106 2308 2116 1915
> KSwapd time congest waited 139924ms 157832ms 125652ms 132516ms
> KSwapd full congest waited 1346 1530 1202 1278
> KSwapd number conditional waited 12922 16320 10943 14670
> KSwapd time conditional waited 0ms 0ms 0ms 0ms
> KSwapd full conditional waited 0 0 0 0
>
>
> Reclaim statistics are not radically changed. The stall times in kswapd
> are massive but it is clear that it is due to calls to congestion_wait()
> and that is almost certainly the call in balance_pgdat(). Otherwise stalls
> due to dirty pages are non-existant.
>
> I ran a benchmark that stressed high-order allocation. This is very
> artifical load but was used in the past to evaluate lumpy reclaim and
> compaction. Generally I look at allocation success rates and latency figures.
>
> STRESS-HIGHALLOC
> 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3
> Pass 1 81.00 ( 0.00%) 28.00 (-53.00%) 24.00 (-57.00%) 28.00 (-53.00%)
> Pass 2 82.00 ( 0.00%) 39.00 (-43.00%) 38.00 (-44.00%) 43.00 (-39.00%)
> while Rested 88.00 ( 0.00%) 87.00 (-1.00%) 88.00 ( 0.00%) 88.00 ( 0.00%)
>
> MMTests Statistics: duration
> Sys Time Running Test (seconds) 740.93 681.42 685.14 684.87
> User+Sys Time Running Test (seconds) 2922.65 3269.52 3281.35 3279.44
> Total Elapsed Time (seconds) 1161.73 1152.49 1159.55 1161.44
>
> MMTests Statistics: vmstat
> Page Ins 4486020 2807256 2855944 2876244
> Page Outs 7261600 7973688 7975320 7986120
> Swap Ins 31694 0 0 0
> Swap Outs 98179 0 0 0
> Direct pages scanned 53494 57731 34406 113015
> Kswapd pages scanned 6271173 1287481 1278174 1219095
> Kswapd pages reclaimed 2029240 1281025 1260708 1201583
> Direct pages reclaimed 1468 14564 16649 92456
> Kswapd efficiency 32% 99% 98% 98%
> Kswapd velocity 5398.133 1117.130 1102.302 1049.641
> Direct efficiency 2% 25% 48% 81%
> Direct velocity 46.047 50.092 29.672 97.306
> Percentage direct scans 0% 4% 2% 8%
> Page writes by reclaim 1616049 0 0 0
> Page writes file 1517870 0 0 0
> Page writes anon 98179 0 0 0
> Page reclaim immediate 103778 27339 9796 17831
> Page rescued immediate 0 0 0 0
> Slabs scanned 1096704 986112 980992 998400
> Direct inode steals 223 215040 216736 247881
> Kswapd inode steals 175331 61548 68444 63066
> Kswapd skipped wait 21991 0 1 0
> THP fault alloc 1 135 125 134
> THP collapse alloc 393 311 228 236
> THP splits 25 13 7 8
> THP fault fallback 0 0 0 0
> THP collapse fail 3 5 7 7
> Compaction stalls 865 1270 1422 1518
> Compaction success 370 401 353 383
> Compaction failures 495 869 1069 1135
> Compaction pages moved 870155 3828868 4036106 4423626
> Compaction move failure 26429 23865 29742 27514
>
> Success rates are completely hosed for 3.4-rc2 which is almost certainly
> due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. I
> expected this would happen for kswapd and impair allocation success rates
> (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much
> a difference: 80% less scanning, 37% less reclaim by kswapd
>
> In comparison, reclaim/compaction is not aggressive and gives up easily
> which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be
> much more aggressive about reclaim/compaction than THP allocations are. The
> stress test above is allocating like neither THP or hugetlbfs but is much
> closer to THP.
>
> Mainline is now impaired in terms of high order allocation under heavy load
> although I do not know to what degree as I did not test with __GFP_REPEAT.
> Keep this in mind for bugs related to hugepage pool resizing, THP allocation
> and high order atomic allocation failures from network devices.
>
> In terms of congestion throttling, I see the following for this test
>
> FTrace Reclaim Statistics: congestion_wait
> Direct number congest waited 3 0 0 0
> Direct time congest waited 0ms 0ms 0ms 0ms
> Direct full congest waited 0 0 0 0
> Direct number conditional waited 957 512 1081 1075
> Direct time conditional waited 0ms 0ms 0ms 0ms
> Direct full conditional waited 0 0 0 0
> KSwapd number congest waited 36 4 3 5
> KSwapd time congest waited 3148ms 400ms 300ms 500ms
> KSwapd full congest waited 30 4 3 5
> KSwapd number conditional waited 88514 197 332 542
> KSwapd time conditional waited 4980ms 0ms 0ms 0ms
> KSwapd full conditional waited 49 0 0 0
>
> The "conditional waited" times are the most interesting as this is directly
> impacted by the number of dirty pages encountered during scan. As lumpy
> reclaim is no longer scanning contiguous ranges, it is finding fewer dirty
> pages. This brings wait times from about 5 seconds to 0. kswapd itself is
> still calling congestion_wait() so it'll still stall but it's a lot less.
>
> In terms of the type of IO we were doing, I see this
>
> FTrace Reclaim Statistics: mm_vmscan_writepage
> Direct writes anon sync 0 0 0 0
> Direct writes anon async 0 0 0 0
> Direct writes file sync 0 0 0 0
> Direct writes file async 0 0 0 0
> Direct writes mixed sync 0 0 0 0
> Direct writes mixed async 0 0 0 0
> KSwapd writes anon sync 0 0 0 0
> KSwapd writes anon async 91682 0 0 0
> KSwapd writes file sync 0 0 0 0
> KSwapd writes file async 822629 0 0 0
> KSwapd writes mixed sync 0 0 0 0
> KSwapd writes mixed async 0 0 0 0
>
> In 3.2, kswapd was doing a bunch of async writes of pages but
> reclaim/compaction was never reaching a point where it was doing sync
> IO. This does not guarantee that reclaim/compaction was not calling
> wait_on_page_writeback() but I would consider it unlikely. It indicates
> that merging patches 2 and 3 to stop reclaim/compaction calling
> wait_on_page_writeback() should be safe.
>
> include/trace/events/vmscan.h | 40 ++-----
> mm/vmscan.c | 263 ++++-------------------------------------
> 2 files changed, 37 insertions(+), 266 deletions(-)
>
> --
> 1.7.9.2
>
It might be a naive question, what we do w/ users with the following
in the .config file?
# CONFIG_COMPACTION is not set
--Ying
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/