RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

From: Prasad, Aravinda
Date: Wed Mar 20 2024 - 08:31:37 EST




> -----Original Message-----
> From: SeongJae Park <sj@xxxxxxxxxx>
> Sent: Tuesday, March 19, 2024 10:51 AM
> To: Prasad, Aravinda <aravinda.prasad@xxxxxxxxx>
> Cc: damon@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx; sj@xxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; s2322819@xxxxxxxx; Kumar, Sandeep4
> <sandeep4.kumar@xxxxxxxxx>; Huang, Ying <ying.huang@xxxxxxxxx>; Hansen,
> Dave <dave.hansen@xxxxxxxxx>; Williams, Dan J <dan.j.williams@xxxxxxxxx>;
> Subramoney, Sreenivas <sreenivas.subramoney@xxxxxxxxx>; Kervinen, Antti
> <antti.kervinen@xxxxxxxxx>; Kanevskiy, Alexander
> <alexander.kanevskiy@xxxxxxxxx>
> Subject: Re: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
>
> Hi Aravinda,
>
>
> Thank you for posting this new revision!
>
> I remember I told you that I don't see a high level significant problems on on the
> reply to the previous revision of this patch[1], but I show a concern now.
> Sorry for not raising this earlier, but let me explain my humble concerns before
> being even more late.

Please find my comments below:

>
> On Mon, 18 Mar 2024 18:58:45 +0530 Aravinda Prasad
> <aravinda.prasad@xxxxxxxxx> wrote:
>
> > DAMON randomly samples one or more pages in every region and tracks
> > accesses to them using the ACCESSED bit in PTE (or PMD for 2MB pages).
> > When the region size is large (e.g., several GBs), which is common for
> > large footprint applications, detecting whether the region is accessed
> > or not completely depends on whether the pages that are actively
> > accessed in the region are picked during random sampling.
> > If such pages are not picked for sampling, DAMON fails to identify the
> > region as accessed. However, increasing the sampling rate or
> > increasing the number of regions increases CPU overheads of kdamond.
>
> DAMON uses sampling because it considers a region as accessed if a portion of
> the region that big enough to be detected via sampling is all accessed. If a region
> is having some pages that really accessed but the proportion is too small to be
> found via sampling, I think DAMON could say the overall access to the region is
> only modest and could even be ignored. In my humble opinion, this fits with the
> definition of DAMON region: A memory address range that constructed with
> pages having similar access frequency.

Agree that DAMON considers a region as accessed if a good portion of the region
is accessed. But few points I would like to discuss:

For large regions (say 10GB, that has 2,621,440 4K pages), sampling at PTE level
will not cover a good portion of the region. For example, default 5ms sampling
and 100ms aggregation samples only 20 4K pages in an aggregation interval.
Increasing sampling to 1 ms and aggregation to 1 second can only cover
1000 4K pages, but results in higher CPU overheads due to frequent sampling
Even increasing the aggregation interval to 60 seconds but sampling at 5ms can
only cover 12000 samples, but region splitting and merging happens once
in 60 seconds.

In addition, this worsens when region sizes are say 100GB+. We observe that
sampling at PTE level does not help for large regions as more samples are
are required. So, decreasing/increasing the sampling or aggressions intervals
proportional to the region size is not practical as all regions are of not equal
size, we can have 100GB regions as well as many small regions (e.g., 16KB
to 1MB). So tuning sampling rate and aggregation interval did not help
for large regions.

It can also be observed that large regions cannot be avoided. Large regions
are created by merging adjacent smaller regions or at the beginning of
profiling (depending on min regions parameter which defaults to 10).
Increasing min region reduces the size of regions but increases kdamond
overheads, hence, not preferable.

So, sampling at PTE level cannot precisely detect accesses to large regions
resulting in inaccuracies, even though it works for small regions.
>From our experiments, we found that with 10% hot data in a large region
(80+GB regions in a 1TB+ footprint application), DAMON was not able to
detect a single access to that region in 95+% cases with different sample
and aggregation interval combinations. But DAMON works good for
applications with footprint <50GB where regions are typically small.

Now consider the scenario with the proposed enhancement. With a
100GB region, if we sample a PUD entry that covers 1GB address
space, then the default 5ms sampling and 100ms aggregation samples
20 PUD entries that is 20 GB portion of the region. This gives a good
estimate of the portion of the region that is accessed. But the downside
is that as PUD accessed bit is set even if a small set of pages are accessed
under its subtree this can report more access as real as you noted.

But such large regions are split into two in the next aggregation interval.
As the splitting of the regions continues, in next few aggregation intervals
many smaller regions are formed. Once smaller regions are formed,
the proposed enhancement cannot pick higher levels of the page table
tree and behaves as good as default DAMON. So, with the proposed
enhancement, larger regions are quickly split into smaller regions if they
have only small set of pages accessed.

To avoid misinterpreting region access count, I feel that the "age" of the
region is of real help and should be considered by both DAMON and the
proposed enhancement. If the age of a region is small (<10) then that
region should not be considered stable and hence should not be
considered for any memory tiering decisions. For regions with age,
say >10, can be considered as stable as they reflect the actual access
frequency.

>
> >
> > This patch proposes profiling different levels of the
> > application\u2019s page table tree to detect whether a region is
> > accessed or not. This patch set is based on the observation that, when
> > the accessed bit for a page is set, the accessed bits at the higher
> > levels of the page table tree (PMD/PUD/PGD) corresponding to the path
> > of the page table walk are also set. Hence, it is efficient to check
> > the accessed bits at the higher levels of the page table tree to
> > detect whether a region is accessed or not. For example, if the access
> > bit for a PUD entry is set, then one or more pages in the 1GB PUD
> > subtree is accessed as each PUD entry covers 1GB mapping. Hence,
> > instead of sampling thousands of 4K/2M pages to detect accesses in a
> > large region, sampling at the higher level of page table tree is faster and
> efficient.
>
> Due to the above reason, I concern this could result in making DAMON monitoring
> results be inaccurately biased to report more than real accesses.

DAMON, even without the proposed enhancement, can result in inaccuracies
for large regions, (see examples above).

>
> >
> > This patch set is based on 6.8-rc5 kernel (commit: f48159f8,
> > mm-unstable
> > tree)
> >
> > Changes since v1 [1]
> > ====================
> >
> > - Added support for 5-level page table tree
> > - Split the patch to mm infrastructure changes and DAMON enhancements
> > - Code changes as per comments on v1
> > - Added kerneldoc comments
> >
> > [1] https://lkml.org/lkml/2023/12/15/272
> >
> > Evaluation:
> >
> > - MASIM benchmark with 1GB, 10GB, 100GB footprint with 10% hot data
> > and 5TB with 10GB hot data.
> > - DAMON: 5ms sampling, 200ms aggregation interval. Rest all
> > parameters set to default value.
> > - DAMON+PTP: Page table profiling applied to DAMON with the above
> > parameters.
> >
> > Profiling efficiency in detecting hot data:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON >90% <50% ~0% 0%
> > DAMON+PTP >90% >90% >90% >90%
>
> Sampling interval is the time interval that assumed to be large enough for the
> workload to make meaningful amount of accesses within the interval. Hence,
> meaningful amount of sampling interval depends on the workload's characteristic
> and system's memory bandwidth.
>
> Here, the size of the hot memory region is about 100MB, 1GB, 10GB, and 10GB
> for the four cases, respectively. And you set the sampling interval as 5ms. Let's
> assume the system can access, say, 50 GB per second, and hence it could be able
> to access only up to 250 MB per 5ms. So, in case of 1GB and footprint, all hot
> memory region would be accessed while DAMON is waiting for next sampling
> interval. Hence, DAMON would be able to see most accesses via sampling. But
> for 100GB footprint case, only 250MB / 10GB = about 2.5% of the hot memory
> region would be accessed between the sampling interval. DAMON cannot see
> whole accesses, and hence the precision could be low.
>
> I don't know exact memory bandwith of the system, but to detect the 10 GB hot
> region with 5ms sampling interval, the system should be able to access 2GB
> memory per millisecond, or about 2TB memory per second. I think systems of
> such memory bandwidth is not that common.
>
> I show you also explored a configuration setting the aggregation interval higher.
> But because each sampling checks only access between the sampling interval,
> that might not help in this setup. I'm wondering if you also explored increasing
> sampling interval.
>

What we have observed that many real-world benchmarks we experimented
with do not saturate the memory bandwidth. We also experimented with
masim microbenchmark to understand the impact on memory access rate
(we inserted delay between memory access operations in do_rnd_ro() and
other functions). We see decrease in the precision as access intensity is
reduced. We have experimented with different sampling and aggregation
intervals, but that did not help much in improving precision.

So, what I think is it that most of the cases the precision depends on the page
(hot or cold) that is randomly picked for sampling than the sampling rate. Most
of the time only cold 4K pages are picked in a large region as they typically
account for 90% of the pages in the region and hence DAMON does not
detect any accesses at all. By profiling higher levels of the page table tree
this can be improved.

> Sorry again for finding this concern not early enough. But I think we may need to
> discuss about this first.

Absolutely no problem. Please let me know your thoughts.

Regards,
Aravinda

>
> [1] https://lkml.kernel.org/r/20231215201159.73845-1-sj@xxxxxxxxxx
>
>
> Thanks,
> SJ
>
>
> >
> > CPU overheads (in billion cycles) for kdamond:
> >
> > Footprint 1GB 10GB 100GB 5TB
> > ---------------------------------------------
> > DAMON 1.15 19.53 3.52 9.55
> > DAMON+PTP 0.83 3.20 1.27 2.55
> >
> > A detailed explanation and evaluation can be found in the arXiv paper:
> > https://arxiv.org/pdf/2311.10275.pdf
> >
> >
> > Aravinda Prasad (3):
> > mm/damon: mm infrastructure support
> > mm/damon: profiling enhancement
> > mm/damon: documentation updates
> >
> > Documentation/mm/damon/design.rst | 42 ++++++
> > arch/x86/include/asm/pgtable.h | 20 +++
> > arch/x86/mm/pgtable.c | 28 +++-
> > include/linux/mmu_notifier.h | 36 +++++
> > include/linux/pgtable.h | 79 ++++++++++
> > mm/damon/vaddr.c | 233 ++++++++++++++++++++++++++++--
> > 6 files changed, 424 insertions(+), 14 deletions(-)
> >
> > --
> > 2.21.3