Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
From: SeongJae Park
Date: Tue Mar 19 2024 - 01:21:05 EST
Hi Aravinda,
Thank you for posting this new revision!
I remember I told you that I don't see a high level significant problems on on
the reply to the previous revision of this patch[1], but I show a concern now.
Sorry for not raising this earlier, but let me explain my humble concerns
before being even more late.
On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad <aravinda.prasad@xxxxxxxxx> wrote:
> DAMON randomly samples one or more pages in every region and tracks
> accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> When the region size is large (e.g., several GBs), which is common
> for large footprint applications, detecting whether the region is
> accessed or not completely depends on whether the pages that are
> actively accessed in the region are picked during random sampling.
> If such pages are not picked for sampling, DAMON fails to identify
> the region as accessed. However, increasing the sampling rate or
> increasing the number of regions increases CPU overheads of kdamond.
DAMON uses sampling because it considers a region as accessed if a portion of
the region that big enough to be detected via sampling is all accessed. If a
region is having some pages that really accessed but the proportion is too
small to be found via sampling, I think DAMON could say the overall access to
the region is only modest and could even be ignored. In my humble opinion,
this fits with the definition of DAMON region: A memory address range that
constructed with pages having similar access frequency.
>
> This patch proposes profiling different levels of the application\u2019s
> page table tree to detect whether a region is accessed or not. This
> patch set is based on the observation that, when the accessed bit for a
> page is set, the accessed bits at the higher levels of the page table
> tree (PMD/PUD/PGD) corresponding to the path of the page table walk
> are also set. Hence, it is efficient to check the accessed bits at
> the higher levels of the page table tree to detect whether a region
> is accessed or not. For example, if the access bit for a PUD entry
> is set, then one or more pages in the 1GB PUD subtree is accessed as
> each PUD entry covers 1GB mapping. Hence, instead of sampling
> thousands of 4K/2M pages to detect accesses in a large region,
> sampling at the higher level of page table tree is faster and efficient.
Due to the above reason, I concern this could result in making DAMON monitoring
results be inaccurately biased to report more than real accesses.
>
> This patch set is based on 6.8-rc5 kernel (commit: f48159f8, mm-unstable
> tree)
>
> Changes since v1 [1]
> ====================
>
> - Added support for 5-level page table tree
> - Split the patch to mm infrastructure changes and DAMON enhancements
> - Code changes as per comments on v1
> - Added kerneldoc comments
>
> [1] https://lkml.org/lkml/2023/12/15/272
>
> Evaluation:
>
> - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> and 5TB with 10GB hot data.
> - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> parameters set to default value.
> - DAMON+PTP: Page table profiling applied to DAMON with the above
> parameters.
>
> Profiling efficiency in detecting hot data:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON >90% <50% ~0% 0%
> DAMON+PTP >90% >90% >90% >90%
Sampling interval is the time interval that assumed to be large enough for the
workload to make meaningful amount of accesses within the interval. Hence,
meaningful amount of sampling interval depends on the workload's characteristic
and system's memory bandwidth.
Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB for
the four cases, respectively. And you set the sampling interval as 5ms. Let's
assume the system can access, say, 50 GB per second, and hence it could be able
to access only up to 250 MB per 5ms. So, in case of 1GB and footprint, all hot
memory region would be accessed while DAMON is waiting for next sampling
interval. Hence, DAMON would be able to see most accesses via sampling. But
for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
region would be accessed between the sampling interval. DAMON cannot see whole
accesses, and hence the precision could be low.
I don't know exact memory bandwith of the system, but to detect the 10 GB hot
region with 5ms sampling interval, the system should be able to access 2GB
memory per millisecond, or about 2TB memory per second. I think systems of
such memory bandwidth is not that common.
I show you also explored a configuration setting the aggregation interval
higher. But because each sampling checks only access between the sampling
interval, that might not help in this setup. I'm wondering if you also
explored increasing sampling interval.
Sorry again for finding this concern not early enough. But I think we may need
to discuss about this first.
[1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@xxxxxxxxxx
Thanks,
SJ
>
> CPU overheads (in billion cycles) for kdamond:
>
> Footprint 1GB 10GB 100GB 5TB
> ---------------------------------------------
> DAMON 1.15 19.53 3.52 9.55
> DAMON+PTP 0.83 3.20 1.27 2.55
>
> A detailed explanation and evaluation can be found in the arXiv paper:
> https://arxiv.org/pdf/2311.10275.pdf
>
>
> Aravinda Prasad (3):
> mm/damon: mm infrastructure support
> mm/damon: profiling enhancement
> mm/damon: documentation updates
>
> Documentation/mm/damon/design.rst | 42 ++++++
> arch/x86/include/asm/pgtable.h | 20 +++
> arch/x86/mm/pgtable.c | 28 +++-
> include/linux/mmu_notifier.h | 36 +++++
> include/linux/pgtable.h | 79 ++++++++++
> mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> 6 files changed, 424 insertions(+), 14 deletions(-)
>
> --
> 2.21.3