Re: [RFC PATCH 0/5] Introduce Data Access MONitor (DAMON)

From: SeongJae Park
Date: Mon Jan 13 2020 - 03:57:04 EST


Adding more recipients for comments. The original RFC mail is available at:
https://lore.kernel.org/linux-mm/20200110131522.29964-1-sjpark@xxxxxxxxxx/


Thanks,
SeongJae Park

On Fri, 10 Jan 2020 14:15:17 +0100 SeongJae Park <sjpark@xxxxxxxxxx> wrote:

> From: SeongJae Park <sjpark@xxxxxxxxx>
>
> This RFC patchset introduces a new kernel module for practical monitoring of
> data accesses, namely DAMON.
>
> The patches are organized in the following sequence. The first and second
> patch introduces the core logic and the raw level user interface of DAMON,
> respectively. To provide a minimal reference to the raw level interfaces and
> for more convenient test of the DAMON itself, the third patch implements an
> user space wrapper tools for the DAMON. The fourth patch adds a document for
> the DAMON, and finally the fifth patch provides DAMON's unit tests, which is
> using the kunit framework.
>
> The patches are based on the v5.4 plus the back-ported kunit, which retrieved
> from v5.5-rc1. You can also clone the complete git tree by:
>
> $ git clone git://github.com/sjp38/linux -b damon/rfc/v1
>
> The web is also available:
> https://github.com/sjp38/linux/releases/tag/damon/rfc/v1
>
> ----
>
> DAMON is a kernel module that allows users to monitor the actual memory access
> pattern of specific user-space processes. It aims to be 1) accurate enough to
> be useful for performance-centric domains, and 2) sufficiently light-weight so
> that it can be applied online.
>
> For the goals, DAMON utilizes its two core mechanisms, called region-based
> sampling and adaptive regions adjustment. The region-based sampling allows
> users to make their own trade-off between the quality and the overhead of the
> monitoring and set the upperbound of the monitoring overhead. Further, the
> adaptive regions adjustment mechanism makes DAMON to maximize the quality and
> minimize the overhead with its best efforts while preserving the users
> configured trade-off.
>
>
> Background
> ==========
>
> For performance-centric analysis and optimizations of memory management schemes
> (either that of kernel space or user space), the actual data access pattern of
> the workloads is highly useful. The information need to be only reasonable
> rather than strictly correct, because some level of incorrectness can be
> handled in many performance-centric domains. It also need to be taken within
> reasonably short time with only light-weight overhead.
>
> Manually extracting such data is not easy and time consuming if the target
> workload is huge and complex, even for the developers of the programs. There
> are a range of tools and techniques developed for general memory access
> investigations, and some of those could be partially used for this purpose.
> However, most of those are not practical or unscalable, mainly because those
> are designed with no consideration about the trade-off between the accuracy of
> the output and the overhead.
>
> The memory access instrumentation techniques which is applied to many tools
> such as Intel PIN is essential for correctness required cases such as invalid
> memory access bug detections. However, those usually incur high overhead which
> is unacceptable for many of the performance-centric domains. Periodic access
> checks based on H/W or S/W access counting features (e.g., the Accessed bits of
> PTEs or the PG_Idle flags of pages) can dramatically decrease the overhead by
> forgiving some of the quality, compared to the instrumentation based
> techniques. The reduced quality is still reasonable for many of the domains,
> but the overhead can arbitrarily increase as the size of the target workload
> grows. Miniature-like static region based sampling can set the upperbound of
> the overhead, but it will now decrease the quality of the output as the size of
> the workload grows.
>
>
> Related Works
> =============
>
> There are a number of researches[1,2,3,4,5,6] optimizing memory management
> mechanisms based on the actual memory access patterns that shows impressive
> results. However, most of those has no deep consideration about the monitoring
> of the accesses itself. Some of those focused on the overhead of the
> monitoring, but does not consider the accuracy scalability[6] or has additional
> dependencies[7]. Indeed, one recent research[5] about the proactive
> reclamation has also proposed[8] to the kernel community but the monitoring
> overhead was considered a main problem.
>
> [1] Subramanya R Dulloor, Amitabha Roy, Zheguang Zhao, Narayanan Sundaram,
> Nadathur Satish, Rajesh Sankaran, Jeff Jackson, and Karsten Schwan. 2016.
> Data tiering in heterogeneous memory systems. In Proceedings of the 11th
> European Conference on Computer Systems (EuroSys). ACM, 15.
> [2] Youngjin Kwon, Hangchen Yu, Simon Peter, Christopher J Rossbach, and Emmett
> Witchel. 2016. Coordinated and efficient huge page management with ingens.
> In 12th USENIX Symposium on Operating Systems Design and Implementation
> (OSDI). 705â721.
> [3] Harald Servat, Antonio J PeÃa, GermÃn Llort, Estanislao Mercadal,
> HansChristian Hoppe, and JesÃs Labarta. 2017. Automating the application
> data placement in hybrid memory systems. In 2017 IEEE International
> Conference on Cluster Computing (CLUSTER). IEEE, 126â136.
> [4] Vlad Nitu, Boris Teabe, Alain Tchana, Canturk Isci, and Daniel Hagimont.
> 2018. Welcome to zombieland: practical and energy-efficient memory
> disaggregation in a datacenter. In Proceedings of the 13th European
> Conference on Computer Systems (EuroSys). ACM, 16.
> [5] Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw
> Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid
> Shahid, Greg Thelen, Kamil Adam Yurtsever, Yu Zhao, and Parthasarathy
> Ranganathan. 2019. Software-Defined Far Memory in Warehouse-Scale
> Computers. In Proceedings of the 24th International Conference on
> Architectural Support for Programming Languages and Operating Systems
> (ASPLOS). ACM, New York, NY, USA, 317â330.
> DOI:https://doi.org/10.1145/3297858.3304053
> [6] Carl Waldspurger, Trausti Saemundsson, Irfan Ahmad, and Nohhyun Park.
> 2017. Cache Modeling and Optimization using Miniature Simulations. In 2017
> USENIX Annual Technical Conference (ATC). USENIX Association, Santa
> Clara, CA, 487â498.
> https://www.usenix.org/conference/atc17/technical-sessions/
> [7] Haojie Wang, Jidong Zhai, Xiongchao Tang, Bowen Yu, Xiaosong Ma, and
> Wenguang Chen. 2018. Spindle: Informed Memory Access Monitoring. In 2018
> USENIX Annual Technical Conference (ATC). USENIX Association, Boston, MA,
> 561â574. https://www.usenix.org/conference/atc18/presentation/wang-haojie
> [8] Jonathan Corbet. 2019. Proactively reclaiming idle memory. (2019).
> https://lwn.net/Articles/787611/.
>
>
> Expected Use-cases
> ==================
>
> A straightforward usecase of DAMON would be the program behavior analysis.
> With the DAMON output, users can confirm whether the program is running as
> intended or not. This will be useful for debuggings and tests of design
> points.
>
> The monitored results can also be useful for counting the dynamic working set
> size of workloads. For the administration of memory overcommitted systems or
> selection of the environments (e.g., containers providing different amount of
> memory) for your workloads, this will be useful.
>
> If you are a programmer, you can optimize your program by managing the memory
> based on the actual data access pattern. For example, you can identify the
> dynamic hotness of your data using DAMON and call ``mlock()`` to keep your hot
> data in DRAM, or call ``madvise()`` with ``MADV_PAGEOUT`` to proactively
> reclaim cold data. Even though your program is guaranteed to not encounter
> memory pressure, you can still improve the performance by applying the DAMON
> outputs for call of ``MADV_HUGEPAGE`` and ``MADV_NOHUGEPAGE``. More creative
> optimizations would be possible. Our evaluations of DAMON includes a
> straightforward optimization using the ``mlock()``. Please refer to the below
> Evaluation section for more detail.
>
> As DAMON incurs very low overhead, such optimizations can be applied not only
> offline, but also online. Also, there is no reason to limit such optimizations
> to the user space. Several parts of the kernel's memory management mechanisms
> could be also optimized using DAMON. The reclamation, the THP (de)promotion
> decisions, and the compaction would be such a candidates. Nevertheless,
> current version of DAMON is not highly optimized for the online/in-kernel uses.
>
>
> Mechanisms of DAMON
> ===================
>
>
> Basic Access Check
> ------------------
>
> DAMON basically reports what pages are how frequently accessed. The report is
> passed to users in binary format via a ``result file`` which users can set it's
> path. Note that the frequency is not an absolute number of accesses, but a
> relative frequency among the pages of the target workloads.
>
> Users can also control the resolution of the reports by setting two time
> intervals, ``sampling interval`` and ``aggregation interval``. In detail,
> DAMON checks access to each page per ``sampling interval``, aggregates the
> results (counts the number of the accesses to each page), and reports the
> aggregated results per ``aggregation interval``. For the access check of each
> page, DAMON uses the Accessed bits of PTEs.
>
> This is thus similar to the previously mentioned periodic access checks based
> mechanisms, which overhead is increasing as the size of the target process
> grows.
>
>
> Region Based Sampling
> ---------------------
>
> To avoid the unbounded increase of the overhead, DAMON groups a number of
> adjacent pages that assumed to have same access frequencies into a region. As
> long as the assumption (pages in a region have same access frequencies) is
> kept, only one page in the region is required to be checked. Thus, for each
> ``sampling interval``, DAMON randomly picks one page in each region and clears
> its Accessed bit. After one more ``sampling interval``, DAMON reads the
> Accessed bit of the page and increases the access frequency of the region if
> the bit has set meanwhile. Therefore, the monitoring overhead is controllable
> by setting the number of regions. DAMON allows users to set the minimal and
> maximum number of regions for the trade-off.
>
> Except the assumption, this is almost same with the above-mentioned
> miniature-like static region based sampling. In other words, this scheme
> cannot preserve the quality of the output if the assumption is not guaranteed.
>
>
> Adaptive Regions Adjustment
> ---------------------------
>
> At the beginning of the monitoring, DAMON constructs the initial regions by
> evenly splitting the memory mapped address space of the process into the
> user-specified minimal number of regions. In this initial state, the
> assumption is normally not kept and thus the quality could be low. To keep the
> assumption as much as possible, DAMON adaptively merges and splits each region.
> For each ``aggregation interval``, it compares the access frequencies of
> adjacent regions and merges those if the frequency difference is small. Then,
> after it reports and clears the aggregated access frequency of each region, it
> splits each region into two regions if the total number of regions is smaller
> than the half of the user-specified maximum number of regions.
>
> In this way, DAMON provides its best-effort quality and minimal overhead while
> keeping the bounds users set for their trade-off.
>
>
> Applying Dynamic Memory Mappings
> --------------------------------
>
> Only a number of small parts in the super-huge virtual address space of the
> processes is mapped to physical memory and accessed. Thus, tracking the
> unmapped address regions is just wasteful. However, tracking every memory
> mapping change might incur an overhead. For the reason, DAMON applies the
> dynamic memory mapping changes to the tracking regions only for each of an
> user-specified time interval (``regions update interval``).
>
>
> Evaluations
> ===========
>
> A prototype of DAMON has evaluated on an Intel Xeon E7-8837 machine using 20
> benchmarks that picked from SPEC CPU 2006, NAS, Tensorflow Benchmark,
> SPLASH-2X, and PARSEC 3 benchmark suite. Nonethless, this section provides
> only summary of the results. For more detail, please refer to the slides used
> for the introduction of DAMON at the Linux Plumbers Conference 2019[1] or the
> MIDDLEWARE'19 industrial track paper[2].
>
>
> Quality
> -------
>
> We first traced and visualized the data access pattern of each workload. We
> were able to confirm that the visualized results are reasonably accurate by
> manually comparing those with the source code of the workloads.
>
> To see the usefulness of the monitoring, we optimized 9 memory intensive
> workloads among them for memory pressure situations using the DAMON outputs.
> In detail, we identified frequently accessed memory regions in each workload
> based on the DAMON results and protected them with ``mlock()`` system calls.
> The optimized versions consistently show speedup (2.55x in best case, 1.65x in
> average) under memory pressure situation.
>
>
> Overhead
> --------
>
> We also measured the overhead of DAMON. It was not only under the upperbound
> we set, but was much lower (0.6 percent of the bound in best case, 13.288
> percent of the bound in average). This reduction of the overhead is mainly
> resulted from the adaptive regions adjustment. We also compared the overhead
> with that of the straightforward periodic Accessed bit check-based monitoring,
> which checks the access of every page frame. DAMON's overhead was much smaller
> than the straightforward mechanism by 94,242.42x in best case, 3,159.61x in
> average.
>
>
> References
> ==========
>
> Prototypes of DAMON have introduced by an LPC kernel summit track talk[1] and
> two academic papers[2,3]. Please refer to those for more detailed information,
> especially the evaluations.
>
> [1] SeongJae Park, Tracing Data Access Pattern with Bounded Overhead and
> Best-effort Accuracy. In The Linux Kernel Summit, September 2019.
> https://linuxplumbersconf.org/event/4/contributions/548/
> [2] SeongJae Park, Yunjae Lee, Heon Y. Yeom, Profiling Dynamic Data Access
> Patterns with Controlled Overhead and Quality. In 20th ACM/IFIP
> International Middleware Conference Industry, December 2019.
> https://dl.acm.org/doi/10.1145/3366626.3368125
> [3] SeongJae Park, Yunjae Lee, Yunhee Kim, Heon Y. Yeom, Profiling Dynamic Data
> Access Patterns with Bounded Overhead and Accuracy. In IEEE International
> Workshop on Foundations and Applications of Self- Systems (FAS 2019), June
> 2019.
>
>
> SeongJae Park (5):
> mm: Introduce Data Access MONitor (DAMON)
> mm/damon: Add debugfs interface
> mm/damon: Add minimal user-space tools
> Documentation/admin-guide/mm: Add a document for DAMON
> mm/damon: Add kunit tests
>
> .../admin-guide/mm/data_access_monitor.rst | 235 +++
> Documentation/admin-guide/mm/index.rst | 1 +
> mm/Kconfig | 23 +
> mm/Makefile | 1 +
> mm/damon-test.h | 571 ++++++++
> mm/damon.c | 1266 +++++++++++++++++
> tools/damon/bin2txt.py | 64 +
> tools/damon/damn | 36 +
> tools/damon/heats.py | 358 +++++
> tools/damon/nr_regions.py | 116 ++
> tools/damon/record.py | 182 +++
> tools/damon/report.py | 45 +
> tools/damon/wss.py | 121 ++
> 13 files changed, 3019 insertions(+)
> create mode 100644 Documentation/admin-guide/mm/data_access_monitor.rst
> create mode 100644 mm/damon-test.h
> create mode 100644 mm/damon.c
> create mode 100644 tools/damon/bin2txt.py
> create mode 100644 tools/damon/damn
> create mode 100644 tools/damon/heats.py
> create mode 100644 tools/damon/nr_regions.py
> create mode 100644 tools/damon/record.py
> create mode 100644 tools/damon/report.py
> create mode 100644 tools/damon/wss.py
>
> --
> 2.17.1
>