[RFC PATCH 0/2] Removal of lumpy reclaim

From: Mel Gorman
Date: Wed Mar 28 2012 - 12:06:24 EST


(cc'ing active people in the thread "[patch 68/92] mm: forbid lumpy-reclaim
in shrink_active_list()")

In the interest of keeping my fingers from the flames at LSF/MM, I'm
releasing an RFC for lumpy reclaim removal. The first patch removes removes
lumpy reclaim itself and the second removes reclaim_mode_t. They can be
merged together but the resulting patch is harder to review.

The patches are based on commit e22057c8599373e5caef0bc42bdb95d2a361ab0d
which is after Andrew's tree was merged but before 3.4-rc1 is released.

Roughly 1K of text is removed, over 200 lines of code and struct scan_control
is smaller.

text data bss dec hex filename
6723455 1931304 2260992 10915751 a68fa7 vmlinux-3.3.0-git
6722303 1931304 2260992 10914599 a68b27 vmlinux-3.3.0-lumpyremove-v1

There are behaviour changes caused by the series with details in the
patches themselves. I ran some preliminary tests but coverage is shaky
due to time constraints. The kernels tested were

3.2.0 Vanilla 3.2.0 kernel
3.3.0-git Commit e22057c which will be part of 3.4-rc1
3.3.0-lumpyremove These two patches

fs-mark running in a threaded configuration showed nothing useful

postmark had interesting results. I know postmark is not very useful
as a mail server benchmark but it pushes page reclaim in a manner that
is useful from a testing perspective. Regressions in page reclaim can
result in regressions in postmark when the WSS for postmark is larger than
physical memory.

POSTMARK
3.2.0-vanilla 3.3.0-git lumpyremove-v1r3
Transactions per second: 16.00 ( 0.00%) 19.00 (18.75%) 19.00 (18.75%)
Data megabytes read per second: 18.62 ( 0.00%) 23.18 (24.49%) 22.56 (21.16%)
Data megabytes written per second: 35.49 ( 0.00%) 44.18 (24.49%) 42.99 (21.13%)
Files created alone per second: 26.00 ( 0.00%) 35.00 (34.62%) 34.00 (30.77%)
Files create/transact per second: 8.00 ( 0.00%) 9.00 (12.50%) 9.00 (12.50%)
Files deleted alone per second: 680.00 ( 0.00%) 6124.00 (800.59%) 2041.00 (200.15%)
Files delete/transact per second: 8.00 ( 0.00%) 9.00 (12.50%) 9.00 (12.50%)

MMTests Statistics: duration
Sys Time Running Test (seconds) 119.61 111.16 111.40
User+Sys Time Running Test (seconds) 153.19 144.13 143.29
Total Elapsed Time (seconds) 1171.34 940.97 966.97

MMTests Statistics: vmstat
Page Ins 13797412 13734736 13731792
Page Outs 43284036 42959856 42744668
Swap Ins 7751 0 0
Swap Outs 9617 0 0
Direct pages scanned 334395 0 0
Kswapd pages scanned 9664358 9933599 9929577
Kswapd pages reclaimed 9621893 9932913 9928893
Direct pages reclaimed 334395 0 0
Kswapd efficiency 99% 99% 99%
Kswapd velocity 8250.686 10556.765 10268.754
Direct efficiency 100% 100% 100%
Direct velocity 285.481 0.000 0.000
Percentage direct scans 3% 0% 0%
Page writes by reclaim 9619 0 0
Page writes file 2 0 0
Page writes anon 9617 0 0
Page reclaim immediate 7 0 0
Page rescued immediate 0 0 0
Slabs scanned 38912 38912 38912
Direct inode steals 0 0 0
Kswapd inode steals 154304 160972 158444
Kswapd skipped wait 0 0 0
THP fault alloc 4 4 4
THP collapse alloc 0 0 0
THP splits 3 0 0
THP fault fallback 0 0 0
THP collapse fail 0 0 0
Compaction stalls 1 0 0
Compaction success 1 0 0
Compaction failures 0 0 0
Compaction pages moved 0 0 0
Compaction move failure 0 0 0

It looks like 3.3.0-git is better in general although that "Files deleted
alone per second" looks like an anomaly. Removing lumpy reclaim fully
affects things a bit but not enough to be of concern as monitoring was
running at the same time which disrupts results. Dirty pages were not
being encountered at the end of the LRU so the behaviour change related
to THP allocations stalling on dirty pages would not be triggered.

Note that swap in/out, direct reclaim and page writes from reclaim dropped
to 0 between 3.2.0 and 3.3.0-git. According to a range of results I have
for mainline kernels between 2.6.32 and 3.3.0 on a different machine, this
swap in/out and direct reclaim problem was introduced after 3.0 and fixed
by 3.3.0 with 3.1.x and 3.2.x both showing swap in/out, direct reclaim
and page writes from reclaim. If I had to guess, it was fixed by commits
e0887c19, fe4b1b24 and 0cee34fd but I did not double check[1].

Removing direct reclaim does not make an obvious difference but note that
THP was barely used at all in this benchmark. Benchmarks that stress both
page reclaim and THP at the same time in a meaningful manner are thin on
the ground.

A benchmark that DD writes a large file also showed nothing interesting
but I was not really expecting it to. The test looks for problems related to
a large linear writer and removing lumpy reclaim was unlikely to affect it.

I ran a benchmark that stressed high-order allocation. This is very
artifical load but was used in the past to evaluate lumpy reclaim and
compaction. Generally I look at allocation success rates and latency figures.

STRESS-HIGHALLOC
3.2.0-vanilla 3.3.0-git lumpyremove-v1r3
Pass 1 82.00 ( 0.00%) 27.00 (-55.00%) 32.00 (-50.00%)
Pass 2 70.00 ( 0.00%) 37.00 (-33.00%) 40.00 (-30.00%)
while Rested 90.00 ( 0.00%) 88.00 (-2.00%) 88.00 (-2.00%)

MMTests Statistics: duration
Sys Time Running Test (seconds) 735.12 688.13 683.91
User+Sys Time Running Test (seconds) 2764.46 3278.45 3271.41
Total Elapsed Time (seconds) 1204.41 1140.29 1137.58

MMTests Statistics: vmstat
Page Ins 5426648 2840348 2695120
Page Outs 7206376 7854516 7860408
Swap Ins 36799 0 0
Swap Outs 76903 4 0
Direct pages scanned 31981 43749 160647
Kswapd pages scanned 26658682 1285341 1195956
Kswapd pages reclaimed 2248583 1271621 1178420
Direct pages reclaimed 6397 14416 94093
Kswapd efficiency 8% 98% 98%
Kswapd velocity 22134.225 1127.205 1051.316
Direct efficiency 20% 32% 58%
Direct velocity 26.553 38.367 141.218
Percentage direct scans 0% 3% 11%
Page writes by reclaim 6530481 4 0
Page writes file 6453578 0 0
Page writes anon 76903 4 0
Page reclaim immediate 256742 17832 61576
Page rescued immediate 0 0 0
Slabs scanned 1073152 971776 975872
Direct inode steals 0 196279 205178
Kswapd inode steals 139260 70390 64323
Kswapd skipped wait 21711 1 0
THP fault alloc 1 126 143
THP collapse alloc 324 294 224
THP splits 32 8 10
THP fault fallback 0 0 0
THP collapse fail 5 6 7
Compaction stalls 364 1312 1324
Compaction success 255 343 366
Compaction failures 109 969 958
Compaction pages moved 265107 3952630 4489215
Compaction move failure 7493 26038 24739

Success rates are completely hosed for 3.4-rc1 which is almost certainly
due to [fe2c2a10: vmscan: reclaim at order 0 when compaction is enabled]. I
expected this would happen for kswapd and impair allocation success rates
(https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much
a difference: 95% less scanning, 43% less reclaim by kswapd

In comparison, reclaim/compaction is not aggressive and gives up easily
which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be
much more aggressive about reclaim/compaction than THP allocations are. The
stress test above is allocating like neither THP or hugetlbfs but is much
closer to THP.

Mainline is now impared in terms of high order allocation under heavy load
although I do not know to what degree as I did not test with __GFP_REPEAT.
Still, keep it in mind for bugs related to hugepage pool resizing, THP
allocation and high order atomic allocation failures from network devices.

Despite this, I think we should merge the patches in this series. The
stress tests were very useful when the main user was hugetlb pool resizing
and when rattling out bugs in memory compaction but are now too unrealistic
to draw solid conclusions from. They need to be replaced but that should
not delay the lumpy reclaim removal.

I'd appreciate it people took a look at the patches and see if there was
anything I missed.

[1] Where are these results you say? They are generated using MM Tests to
see what negative trends could be identified. They are still in the
process of running. I've had limited time to dig through the data.

include/trace/events/vmscan.h | 40 ++-----
mm/vmscan.c | 263 ++++-------------------------------------
2 files changed, 37 insertions(+), 266 deletions(-)

--
1.7.9.2

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/