RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON

From: Prasad, Aravinda
Date: Mon Mar 25 2024 - 09:36:30 EST




> -----Original Message-----
> From: SeongJae Park <sj@xxxxxxxxxx>
> Sent: Saturday, March 23, 2024 12:03 AM
> To: Prasad, Aravinda <aravinda.prasad@xxxxxxxxx>
> Cc: SeongJae Park <sj@xxxxxxxxxx>; damon@xxxxxxxxxxxxxxx; linux-
> mm@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; s2322819@xxxxxxxx; Kumar,
> Sandeep4 <sandeep4.kumar@xxxxxxxxx>; Huang, Ying <ying.huang@xxxxxxxxx>;
> Hansen, Dave <dave.hansen@xxxxxxxxx>; Williams, Dan J
> <dan.j.williams@xxxxxxxxx>; Subramoney, Sreenivas
> <sreenivas.subramoney@xxxxxxxxx>; Kervinen, Antti <antti.kervinen@xxxxxxxxx>;
> Kanevskiy, Alexander <alexander.kanevskiy@xxxxxxxxx>
> Subject: RE: [PATCH v2 0/3] mm/damon: Profiling enhancements for DAMON
>
> On Fri, 22 Mar 2024 12:12:09 +0000 "Prasad, Aravinda"
> <aravinda.prasad@xxxxxxxxx> wrote:
>
> [...]
> > > > For large regions (say 10GB, that has 2,621,440 4K pages),
> > > > sampling at PTE level will not cover a good portion of the region.
> > > > For example, default 5ms sampling and 100ms aggregation samples
> > > > only 20 4K pages in an aggregation interval.
> > >
> > > If the 20 attempts all failed at finding any single accessed 4K
> > > page, I think it roughly means less than 5% of the region is
> > > accessed within the user-specified time (aggregation interval). I
> > > would translate that as only tiny portion of the
>
> I now find the above sentence is not correct. Sorry, my bad. Let me re-write.
>
> I think it roughly means the workload is not accessing the region in a frequency
> that high enough for DAMON to observe within the user-specified time (sampling
> interval).
>
> > > region is accessed within the user-specified time, and hence DAMON
> > > is ok to say the region is nearly not accessed.
> >
> > I am looking at it from the other way:
> >
> > To detect if a region is hot or cold at least 1% of the pages in the
> > region should be sampled. For a 10GB region (with 2,621,440 4K pages)
> > this requires sampling at least 26,214 pages. For a 100GB region this
> > will require sampling at least
> > 262,144 pages.
>
> Why do you think 1% of the pages should be sampled?

1% is just an example.

>
> DAMON defines the region as an address range that contains pages having similar
> access frequency. Hence if we see a page was accessed within a given time
> interval, we can assume all pages in the page is also accessed within the interval,
> and vice versa. That's why we sample only single page per region, and how
> DAMON's monitoring overhead can be controlled regardless of the size of the
> monitoring target memory.

Initially when DAMON creates "min" regions, it does not consider access frequency.
They are created by diving the address space. So, at the beginning, these regions
need not have pages with similar access frequency. But eventually, as regions are
split and merged then regions are formed that have similar access frequency.

We observe that hot sets are spread across the address space of the application
and many times, only a portion of the DAMON created regions contain a hot data
as per the application's access pattern. In such cases a single sample per
region is not enough.

For small memory footprint applications with small region size, I agree there are
absolutely no issues (also confirmed by our experiments). But for large footprint
applications (1TB+) that can have large regions (50GB+) we see these issues.

>
> To detect if the region is hot or cold, DAMON continues sampling multiple times
> and use number of sampling intervals that seen the access to the region
> (nr_accesses) as the relative hotness of the region. Hence, how many sampling is
> required depends on what precision of the relative hotness the user wants.
> The size of the region doesn't matter here.
>
> Am I missing something?

As mentioned before all these are working fine for small footprint applications (<100GB).
But as we go beyond 1TB footprint we start seeing issues. I can show you a demo
on 1TB+ footprint applications.

>
> >
> > If we sample at 5ms, this takes 131.072 seconds to cover 1% of 10GB
> > and 1310.72 seconds to cover 100GB.
> >
> > DAMON shows that the selected page as accessed if that page was
> > accessed during the 5ms sampling window. Now if we increase the
> > sampling to 20ms to improve access detection, then covering 1% of the region
> takes even longer.
> >
> > >
> > > > Increasing sampling to 1 ms and aggregation to 1 second can only
> > > > cover
> > > > 1000 4K pages, but results in higher CPU overheads due to frequent
> sampling.
> > > > Even increasing the aggregation interval to 60 seconds but
> > > > sampling at 5ms can only cover 12000 samples, but region splitting
> > > > and merging happens once in 60 seconds.
> > >
> > > At the beginning of each sampling interval, DAMON randomly picks one
> > > page per region, clear their accessed bits, wait until the sampling
> > > interval is finished, and check the accessed bits again. In other
> > > words, DAMON shows only accesses that made in last sampling interval.
> >
> > Yes, I see this in the code:
> >
> > while(time < aggregation_interval)
> > {
> > clear_access_bit
> > sleep(sampling_time)
> > check_access_bit
> > }
> >
> > I would suggest this logic instead.
> >
> > while(time < aggregation_interval)
> > {
> > Number_of_samples = aggregation_interval / sampling_time;
> >
> > for (i = 0, I < number_of_samples; i++)
> > {
> > clear_access_bit
> > }
> >
> > sleep(aggregation_time)
> >
> > for (i = 0, I < number_of_samples; i++)
> > {
> > check_access_bit
> > }
> > }
> >
> > This can help in better access detection. I am sure you would
> > have already explored it.
>
> The way to detect the access in the region is implemented by each monitoring
> operations set (vaddr, fvaddr, and paddr). We could implement yet another
> monitoring operations set with a new access detection method. Nonetheless, I
> think changing existing monitoring operations sets to use this suggestion while
> keeping their concepts would be not easy.

Agree.

>
> >
> > >
> > > Increasing number of samples per aggregation interval can help DAMON
> > > knows the access frequency of regions in finer granularity, but
> > > doesn't allow DAMON see more accesses. Rather than that, if the
> > > aggregation interval is fixed (reducing sampling interval), DAMON can show
> even less amount of accesses.
> > >
> > > What we need here is giving the workload longer sampling time so
> > > that the workload can make access to a size of memory regions that
> > > large enough to be found by DAMON.
> >
> > But even with longer sampling time, we may miss the access. For
> > example, consider all the pages in the region are accessed
> > sequentially. Now if DAMON samples a different page other than the
> > page that is being accessed it will miss. Now even if we have longer sampling
> time it is possible that none of the accesses are detected.
>
> If there was accesses to some pages of the region but unaccessed page has
> picked as the sampling target, someone could say only a tiny portion of the region
> is accessed, so we can treat the region as not accessed at all. That's at least what
> the monitoring operations set you use here ('vaddr') thinks.
>
> [...]
> > > Also, if we can allow large enough age, the random region split will
> > > eventually find the small hot regions even without high level
> > > accessed bit hint. Of course the hint could help finding it
> > > earlier. I think that was one of my comment on the first version of this patch.
> >
> > The problem is that a large region that is split is immediately merged
> > as the split regions have access count zero.
> >
> > We observe that large regions are never getting split at all due to this.
>
> I understand this is a valid concern. Especially because we currently split each
> region into two sub-regions, finding small hot memory region in the middle of a
> huge region could be challenging. This concern has raised before DAMON has
> merged into the mainline by Jonathan Cameron. There was also a research from
> my previous colleague saying incresing the sub-regions for each split improves the
> accuracy. Nonetheless, it increases overall number of regions and hence
> increased the overhead. And we didn't get real issue due to this from the
> production so far, so we still keeping the old behavior. I'm thinking about a way to
> make this better.

These issues are observed only when memory footprint is large enough (1TB+).
Production systems may not be using such large footprint applications yet.

>
> That said, the real system would have more than the single region, and the access
> pattern will be more dynamic. It will cause the region to be merged and split in
> more random and chaotic way. Hence I think there is still a chance to find the
> small hot portion eventually. Also, the sampling regions are picked randomly. A
> page of the small hot portion will eventually picked as sampling target, even
> multiple times, and at least reset the 'age' of the region.
>
> I sometimes turn on DAMON to monitor entire physical address space (about 128
> GiB) of my machine and run no active workload but just a few background
> deamons. So the system would have only small amount of accesses. At the
> beginning, the monitoring output shows all regions as not accessed (nr_accesses
> 0) and having same 'age'. But as time goes by, the regions are still showing no
> access (nr_accesses 0), but different ages and sizes.

Have not tired with physical address space monitoring. But for "vaddr", we see DAMON
working good up to 100GB footprint.

>
> Again, I'm not saying existing monitoring mechanism is perfect and optimum. We
> should continue optimizing it. Nonetheless, the current accuracy is not perfectly
> proved to be too awful to be used in real world. I know at least a few unnamed
> production usages of DAMON, and they didn't complained about DAMON's
> accuracy so far.

We see this problem very consistently on large footprint applications, so could be
working fine for small footprint applications in production.

Regards,
Aravinda


>
>
> Thanks,
> SJ
>
> >
> > Regards,
> > Aravinda
> [...]