Re: [RFC v5 0/7] Implement Data Access Monitoring-based Memory Operation Schemes

From: Jonathan Cameron
Date: Tue Mar 31 2020 - 11:52:10 EST


On Mon, 30 Mar 2020 13:50:35 +0200
SeongJae Park <sjpark@xxxxxxxxxx> wrote:

> From: SeongJae Park <sjpark@xxxxxxxxx>
>
> DAMON[1] can be used as a primitive for data access awared memory management
> optimizations. That said, users who want such optimizations should run DAMON,
> read the monitoring results, analyze it, plan a new memory management scheme,
> and apply the new scheme by themselves. Such efforts will be inevitable for
> some complicated optimizations.
>
> However, in many other cases, the users could simply want the system to apply a
> memory management action to a memory region of a specific size having a
> specific access frequency for a specific time. For example, "page out a memory
> region larger than 100 MiB keeping only rare accesses more than 2 minutes", or
> "Do not use THP for a memory region larger than 2 MiB rarely accessed for more
> than 1 seconds".
>
> This RFC patchset makes DAMON to handle such data access monitoring-based
> operation schemes. With this change, users can do the data access awared
> optimizations by simply specifying their schemes to DAMON.


Hi SeongJae,

I'm wondering if I'm misreading the results below or a data handling mixup
occured. See inline.

Thanks,

Jonathan

>
>
> Evaluations
> ===========
>
> Setup
> -----
>
> On my personal QEMU/KVM based virtual machine on an Intel i7 host machine
> running Ubuntu 18.04, I measure runtime and consumed system memory while
> running various realistic workloads with several configurations. I use 13 and
> 12 workloads in PARSEC3[3] and SPLASH-2X[4] benchmark suites, respectively. I
> personally use another wrapper scripts[5] for setup and run of the workloads.
> On top of this patchset, we also applied the DAMON-based operation schemes
> patchset[6] for this evaluation.
>
> Measurement
> ~~~~~~~~~~~
>
> For the measurement of the amount of consumed memory in system global scope, I
> drop caches before starting each of the workloads and monitor 'MemFree' in the
> '/proc/meminfo' file. To make results more stable, I repeat the runs 5 times
> and average results. You can get stdev, min, and max of the numbers among the
> repeated runs in appendix below.
>
> Configurations
> ~~~~~~~~~~~~~~
>
> The configurations I use are as below.
>
> orig: Linux v5.5 with 'madvise' THP policy
> rec: 'orig' plus DAMON running with record feature
> thp: same with 'orig', but use 'always' THP policy
> ethp: 'orig' plus a DAMON operation scheme[6], 'efficient THP'
> prcl: 'orig' plus a DAMON operation scheme, 'proactive reclaim[7]'
>
> I use 'rec' for measurement of DAMON overheads to target workloads and system
> memory. The remaining configs including 'thp', 'ethp', and 'prcl' are for
> measurement of DAMON monitoring accuracy.
>
> 'ethp' and 'prcl' is simple DAMON-based operation schemes developed for
> proof of concepts of DAMON. 'ethp' reduces memory space waste of THP by using
> DAMON for decision of promotions and demotion for huge pages, while 'prcl' is
> as similar as the original work. Those are implemented as below:
>
> # format: <min/max size> <min/max frequency (0-100)> <min/max age> <action>
> # ethp: Use huge pages if a region >2MB shows >5% access rate, use regular
> # pages if a region >2MB shows <5% access rate for >1 second
> 2M null 5 null null null hugepage
> 2M null null 5 1s null nohugepage
>
> # prcl: If a region >4KB shows <5% access rate for >5 seconds, page out.
> 4K null null 5 5s null pageout
>
> Note that both 'ethp' and 'prcl' are designed with my only straightforward
> intuition, because those are for only proof of concepts and monitoring accuracy
> of DAMON. In other words, those are not for production. For production use,
> those should be tuned more.
>
>
> [1] "Redis latency problems troubleshooting", https://redis.io/topics/latency
> [2] "Disable Transparent Huge Pages (THP)",
> https://docs.mongodb.com/manual/tutorial/transparent-huge-pages/
> [3] "The PARSEC Becnhmark Suite", https://parsec.cs.princeton.edu/index.htm
> [4] "SPLASH-2x", https://parsec.cs.princeton.edu/parsec3-doc.htm#splash2x
> [5] "parsec3_on_ubuntu", https://github.com/sjp38/parsec3_on_ubuntu
> [6] "[RFC v4 0/7] Implement Data Access Monitoring-based Memory Operation
> Schemes",
> https://lore.kernel.org/linux-mm/20200303121406.20954-1-sjpark@xxxxxxxxxx/
> [7] "Proactively reclaiming idle memory", https://lwn.net/Articles/787611/
>
>
> Results
> -------
>
> Below two tables show the measurement results. The runtimes are in seconds
> while the memory usages are in KiB. Each configurations except 'orig' shows
> its overhead relative to 'orig' in percent within parenthesises.
>
> runtime orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> parsec3/blackscholes 107.594 107.956 (0.34) 106.750 (-0.78) 107.672 (0.07) 111.916 (4.02)
> parsec3/bodytrack 79.230 79.368 (0.17) 78.908 (-0.41) 79.705 (0.60) 80.423 (1.50)
> parsec3/canneal 142.831 143.810 (0.69) 123.530 (-13.51) 133.778 (-6.34) 144.998 (1.52)
> parsec3/dedup 11.986 11.959 (-0.23) 11.762 (-1.87) 12.028 (0.35) 13.313 (11.07)
> parsec3/facesim 210.125 209.007 (-0.53) 205.226 (-2.33) 207.766 (-1.12) 209.815 (-0.15)
> parsec3/ferret 191.601 191.177 (-0.22) 190.420 (-0.62) 191.775 (0.09) 192.638 (0.54)
> parsec3/fluidanimate 212.735 212.970 (0.11) 209.151 (-1.68) 211.904 (-0.39) 218.573 (2.74)
> parsec3/freqmine 291.225 290.873 (-0.12) 289.258 (-0.68) 289.884 (-0.46) 298.373 (2.45)
> parsec3/raytrace 118.289 119.586 (1.10) 119.045 (0.64) 119.064 (0.66) 137.919 (16.60)
> parsec3/streamcluster 323.565 328.168 (1.42) 279.565 (-13.60) 287.452 (-11.16) 333.244 (2.99)
> parsec3/swaptions 155.140 155.473 (0.21) 153.816 (-0.85) 156.423 (0.83) 156.237 (0.71)
> parsec3/vips 58.979 59.311 (0.56) 58.733 (-0.42) 59.005 (0.04) 61.062 (3.53)
> parsec3/x264 70.539 68.413 (-3.01) 64.760 (-8.19) 67.180 (-4.76) 68.103 (-3.45)
> splash2x/barnes 80.414 81.751 (1.66) 73.585 (-8.49) 80.232 (-0.23) 115.753 (43.95)
> splash2x/fft 33.902 34.111 (0.62) 24.228 (-28.53) 29.926 (-11.73) 44.438 (31.08)
> splash2x/lu_cb 85.556 86.001 (0.52) 84.538 (-1.19) 86.000 (0.52) 91.447 (6.89)
> splash2x/lu_ncb 93.399 93.652 (0.27) 90.463 (-3.14) 94.008 (0.65) 93.901 (0.54)
> splash2x/ocean_cp 45.253 45.191 (-0.14) 43.049 (-4.87) 44.022 (-2.72) 46.588 (2.95)
> splash2x/ocean_ncp 86.927 87.065 (0.16) 50.747 (-41.62) 86.855 (-0.08) 199.553 (129.57)
> splash2x/radiosity 91.433 91.511 (0.09) 90.626 (-0.88) 91.865 (0.47) 104.524 (14.32)
> splash2x/radix 31.923 32.023 (0.31) 25.194 (-21.08) 32.035 (0.35) 39.231 (22.89)
> splash2x/raytrace 84.367 84.677 (0.37) 82.417 (-2.31) 83.505 (-1.02) 84.857 (0.58)
> splash2x/volrend 87.499 87.495 (-0.00) 86.775 (-0.83) 87.311 (-0.21) 87.511 (0.01)
> splash2x/water_nsquared 236.397 236.759 (0.15) 219.902 (-6.98) 224.228 (-5.15) 238.562 (0.92)
> splash2x/water_spatial 89.646 89.767 (0.14) 89.735 (0.10) 90.347 (0.78) 103.585 (15.55)
> total 3020.570 3028.080 (0.25) 2852.190 (-5.57) 2953.960 (-2.21) 3276.550 (8.47)
>
>
> memused.avg orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> parsec3/blackscholes 1785916.600 1834201.400 (2.70) 1826249.200 (2.26) 1828079.200 (2.36) 1712210.600 (-4.13)
> parsec3/bodytrack 1415049.400 1434317.600 (1.36) 1423715.000 (0.61) 1430392.600 (1.08) 1435136.000 (1.42)
> parsec3/canneal 1043489.800 1058617.600 (1.45) 1040484.600 (-0.29) 1048664.800 (0.50) 1050280.000 (0.65)
> parsec3/dedup 2414453.200 2458493.200 (1.82) 2411379.400 (-0.13) 2400516.000 (-0.58) 2461120.800 (1.93)
> parsec3/facesim 541597.200 550097.400 (1.57) 544364.600 (0.51) 553240.000 (2.15) 552316.400 (1.98)
> parsec3/ferret 317986.600 332346.000 (4.52) 320218.000 (0.70) 331085.000 (4.12) 330895.200 (4.06)
> parsec3/fluidanimate 576183.400 585442.000 (1.61) 577780.200 (0.28) 587703.400 (2.00) 506501.000 (-12.09)
> parsec3/freqmine 990869.200 997817.000 (0.70) 990350.400 (-0.05) 997669.000 (0.69) 763325.800 (-22.96)
> parsec3/raytrace 1748370.800 1757109.200 (0.50) 1746153.800 (-0.13) 1757830.400 (0.54) 1581455.800 (-9.55)
> parsec3/streamcluster 121521.800 140452.400 (15.58) 129725.400 (6.75) 132266.000 (8.84) 130558.200 (7.44)
> parsec3/swaptions 15592.400 29018.800 (86.11) 14765.800 (-5.30) 27260.200 (74.83) 26631.600 (70.80)
> parsec3/vips 2957567.600 2967993.800 (0.35) 2956623.200 (-0.03) 2973062.600 (0.52) 2951402.000 (-0.21)
> parsec3/x264 3169012.400 3175048.800 (0.19) 3190345.400 (0.67) 3189353.000 (0.64) 3172924.200 (0.12)
> splash2x/barnes 1209066.000 1213125.400 (0.34) 1217261.400 (0.68) 1209661.600 (0.05) 921041.800 (-23.82)
> splash2x/fft 9359313.200 9195213.000 (-1.75) 9377562.400 (0.19) 9050957.600 (-3.29) 9517977.000 (1.70)
> splash2x/lu_cb 514966.200 522939.400 (1.55) 520870.400 (1.15) 522635.000 (1.49) 329933.600 (-35.93)
> splash2x/lu_ncb 514180.400 525974.800 (2.29) 521420.200 (1.41) 521063.600 (1.34) 523557.000 (1.82)
> splash2x/ocean_cp 3346493.400 3288078.000 (-1.75) 3382253.800 (1.07) 3289477.600 (-1.70) 3260810.400 (-2.56)
> splash2x/ocean_ncp 3909966.400 3882968.800 (-0.69) 7037196.000 (79.98) 4046363.400 (3.49) 3471452.400 (-11.22)
> splash2x/radiosity 1471119.400 1470626.800 (-0.03) 1482604.200 (0.78) 1472718.400 (0.11) 546893.600 (-62.82)
> splash2x/radix 1748360.800 1729163.400 (-1.10) 1371463.200 (-21.56) 1701993.600 (-2.65) 1817519.600 (3.96)
> splash2x/raytrace 46670.000 60172.200 (28.93) 51901.600 (11.21) 60782.600 (30.24) 52644.800 (12.80)
> splash2x/volrend 150666.600 167444.200 (11.14) 151335.200 (0.44) 163345.000 (8.41) 162760.000 (8.03)
> splash2x/water_nsquared 45720.200 59422.400 (29.97) 46031.000 (0.68) 61801.400 (35.17) 62627.000 (36.98)
> splash2x/water_spatial 663052.200 672855.800 (1.48) 665787.600 (0.41) 674696.200 (1.76) 471052.600 (-28.96)
> total 40077300.000 40108900.000 (0.08) 42997900.000 (7.29) 40032700.000 (-0.11) 37813000.000 (-5.65)
>
>
> DAMON Overheads
> ~~~~~~~~~~~~~~~
>
> In total, DAMON recording feature incurs 0.25% runtime overhead (up to 1.66% in
> worst case with 'splash2x/barnes') and 0.08% memory space overhead.
>
> For convenience test run of 'rec', I use a Python wrapper. The wrapper
> constantly consumes about 10-15MB of memory. This becomes high memory overhead
> if the target workload has small memory footprint. In detail, 16%, 86%, 29%,
> 11%, and 30% overheads shown for parsec3/streamcluster (125 MiB),
> parsec3/swaptions (15 MiB), splash2x/raytrace (45 MiB), splash2x/volrend (151
> MiB), and splash2x/water_nsquared (46 MiB)). Nonetheless, the overheads are
> not from DAMON, but from the wrapper, and thus should be ignored. This fake
> memory overhead continues in 'ethp' and 'prcl', as those configurations are
> also using the Python wrapper.
>
>
> Efficient THP
> ~~~~~~~~~~~~~
>
> THP 'always' enabled policy achieves 5.57% speedup but incurs 7.29% memory
> overhead. It achieves 41.62% speedup in best case, but 79.98% memory overhead
> in worst case. Interestingly, both the best and worst case are with
> 'splash2x/ocean_ncp').

The results above don't seems to support this any more?

> runtime orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> splash2x/ocean_ncp 86.927 87.065 (0.16) 50.747 (-41.62) 86.855 (-0.08) 199.553 (129.57)




>
> The 2-lines implementation of data access monitoring based THP version ('ethp')
> shows 2.21% speedup and -0.11% memory overhead. In other words, 'ethp' removes
> 100% of THP memory waste while preserving 39.67% of THP speedup in total.
>
>
> Proactive Reclamation
> ~~~~~~~~~~~~~~~~~~~~
>
> As same to the original work, I use 'zram' swap device for this configuration.
>
> In total, our 1 line implementation of Proactive Reclamation, 'prcl', incurred
> 8.47% runtime overhead in total while achieving 5.65% system memory usage
> reduction.
>
> Nonetheless, as the memory usage is calculated with 'MemFree' in
> '/proc/meminfo', it contains the SwapCached pages. As the swapcached pages can
> be easily evicted, I also measured the residential set size of the workloads:
>
> rss.avg orig rec (overhead) thp (overhead) ethp (overhead) prcl (overhead)
> parsec3/blackscholes 592502.000 589764.400 (-0.46) 592132.600 (-0.06) 593702.000 (0.20) 406639.400 (-31.37)
> parsec3/bodytrack 32365.400 32195.000 (-0.53) 32210.800 (-0.48) 32114.600 (-0.77) 21537.600 (-33.45)
> parsec3/canneal 839904.200 840292.200 (0.05) 836866.400 (-0.36) 838263.200 (-0.20) 837895.800 (-0.24)
> parsec3/dedup 1208337.200 1218465.600 (0.84) 1233278.600 (2.06) 1200490.200 (-0.65) 882911.400 (-26.93)
> parsec3/facesim 311380.800 311363.600 (-0.01) 315642.600 (1.37) 312573.400 (0.38) 310257.400 (-0.36)
> parsec3/ferret 99514.800 99542.000 (0.03) 100454.200 (0.94) 99879.800 (0.37) 89679.200 (-9.88)
> parsec3/fluidanimate 531760.800 531735.200 (-0.00) 531865.400 (0.02) 531940.800 (0.03) 440781.000 (-17.11)
> parsec3/freqmine 552455.400 552882.600 (0.08) 555793.600 (0.60) 553019.800 (0.10) 58067.000 (-89.49)
> parsec3/raytrace 894798.400 894953.400 (0.02) 892223.400 (-0.29) 893012.400 (-0.20) 315259.800 (-64.77)
> parsec3/streamcluster 110780.400 110856.800 (0.07) 110954.000 (0.16) 111310.800 (0.48) 108066.800 (-2.45)
> parsec3/swaptions 5614.600 5645.600 (0.55) 5553.200 (-1.09) 5552.600 (-1.10) 3251.800 (-42.08)
> parsec3/vips 31942.200 31752.800 (-0.59) 32042.600 (0.31) 32226.600 (0.89) 29012.200 (-9.17)
> parsec3/x264 81770.800 81609.200 (-0.20) 82800.800 (1.26) 82612.200 (1.03) 81805.800 (0.04)
> splash2x/barnes 1216515.600 1217113.800 (0.05) 1225605.600 (0.75) 1217325.000 (0.07) 540108.400 (-55.60)
> splash2x/fft 9668660.600 9751350.800 (0.86) 9773806.400 (1.09) 9613555.400 (-0.57) 7951241.800 (-17.76)
> splash2x/lu_cb 510368.800 510095.800 (-0.05) 514350.600 (0.78) 510276.000 (-0.02) 311584.800 (-38.95)
> splash2x/lu_ncb 509904.800 510001.600 (0.02) 513847.000 (0.77) 510073.400 (0.03) 509905.600 (0.00)
> splash2x/ocean_cp 3389550.600 3404466.000 (0.44) 3443363.600 (1.59) 3410388.000 (0.61) 3330608.600 (-1.74)
> splash2x/ocean_ncp 3923723.200 3911148.200 (-0.32) 7175800.400 (82.88) 4104482.400 (4.61) 2030525.000 (-48.25)
> splash2x/radiosity 1472994.600 1475946.400 (0.20) 1485636.800 (0.86) 1476193.000 (0.22) 262161.400 (-82.20)
> splash2x/radix 1750329.800 1765697.000 (0.88) 1413304.000 (-19.25) 1754154.400 (0.22) 1516142.600 (-13.38)
> splash2x/raytrace 23149.600 23208.000 (0.25) 28574.400 (23.43) 26694.600 (15.31) 16257.800 (-29.77)
> splash2x/volrend 43968.800 43919.000 (-0.11) 44087.600 (0.27) 44224.000 (0.58) 32484.400 (-26.12)
> splash2x/water_nsquared 29348.000 29338.400 (-0.03) 29604.600 (0.87) 29779.400 (1.47) 23644.800 (-19.43)
> splash2x/water_spatial 655263.600 655097.800 (-0.03) 655199.200 (-0.01) 656282.400 (0.16) 379816.800 (-42.04)
> total 28486900.000 28598400.000 (0.39) 31625000.000 (11.02) 28640100.000 (0.54) 20489600.000 (-28.07)
>
> In total, 28.07% of residential sets were reduced.
>
> With parsec3/freqmine, 'prcl' reduced 22.96% of system memory usage and 89.49%
> of residential sets while incurring only 2.45% runtime overhead.
>
>
> Sequence Of Patches
> ===================
>
> The patches are based on the v5.6 plus v7 DAMON patchset[1] and Minchan's
> ``do_madvise()`` patch[2]. Minchan's patch was necessary for reuse of
> ``madvise()`` code in DAMON. You can also clone the complete git tree:
>
> $ git clone git://github.com/sjp38/linux -b damos/rfc/v5
>
> The web is also available:
> https://github.com/sjp38/linux/releases/tag/damos/rfc/v5
>
>
> [1] https://lore.kernel.org/linux-mm/20200318112722.30143-1-sjpark@xxxxxxxxxx/
> [2] https://lore.kernel.org/linux-mm/20200302193630.68771-2-minchan@xxxxxxxxxx/
>
> The first patch allows DAMON to reuse ``madvise()`` code for the actions. The
> second patch accounts age of each region. The third patch implements the
> handling of the schemes in DAMON and exports a kernel space programming
> interface for it. The fourth patch implements a debugfs interface for
> privileged people and programs. The fifth and sixth patches each adds
> kunittests and selftests for these changes, and finally the seventhe patch
> modifies the user space tool for DAMON to support description and applying of
> schemes in human freiendly way.
>
>
> Patch History
> =============
>
> Changes from RFC v4
> (https://lore.kernel.org/linux-mm/20200303121406.20954-1-sjpark@xxxxxxxxxx/)
> - Handle CONFIG_ADVISE_SYSCALL
> - Clean up code (Jonathan Cameron)
> - Update test results
> - Rebase on v5.6 + DAMON v7
>
> Changes from RFC v3
> (https://lore.kernel.org/linux-mm/20200225102300.23895-1-sjpark@xxxxxxxxxx/)
> - Add Reviewed-by from Brendan Higgins
> - Code cleanup: Modularize madvise() call
> - Fix a trivial bug in the wrapper python script
> - Add more stable and detailed evaluation results with updated ETHP scheme
>
> Changes from RFC v2
> (https://lore.kernel.org/linux-mm/20200218085309.18346-1-sjpark@xxxxxxxxxx/)
> - Fix aging mechanism for more better 'old region' selection
> - Add more kunittests and kselftests for this patchset
> - Support more human friedly description and application of 'schemes'
>
> Changes from RFC v1
> (https://lore.kernel.org/linux-mm/20200210150921.32482-1-sjpark@xxxxxxxxxx/)
> - Properly adjust age accounting related properties after splitting, merging,
> and action applying
> SeongJae Park (7):
> mm/madvise: Export do_madvise() to external GPL modules
> mm/damon: Account age of target regions
> mm/damon: Implement data access monitoring-based operation schemes
> mm/damon/schemes: Implement a debugfs interface
> mm/damon-test: Add kunit test case for regions age accounting
> mm/damon/selftests: Add 'schemes' debugfs tests
> damon/tools: Support more human friendly 'schemes' control
>
> include/linux/damon.h | 29 ++
> mm/damon-test.h | 5 +
> mm/damon.c | 428 +++++++++++++++++-
> mm/madvise.c | 1 +
> tools/damon/_convert_damos.py | 125 +++++
> tools/damon/_damon.py | 143 ++++++
> tools/damon/damo | 7 +
> tools/damon/record.py | 135 +-----
> tools/damon/schemes.py | 105 +++++
> .../testing/selftests/damon/debugfs_attrs.sh | 29 ++
> 10 files changed, 878 insertions(+), 129 deletions(-)
> create mode 100755 tools/damon/_convert_damos.py
> create mode 100644 tools/damon/_damon.py
> create mode 100644 tools/damon/schemes.py
>