Re: [PATCH] mm/damon: fix stale TLB young-state handling on arm64

From: SeongJae Park

Date: Tue May 26 2026 - 10:50:54 EST

On Tue, 26 May 2026 08:57:32 +0000 "Kunwu Chan" <kunwu.chan@xxxxxxxxx> wrote:

> May 26, 2026 at 1:46 AM, "SeongJae Park" <sj@xxxxxxxxxx mailto:sj@xxxxxxxxxx?to=%22SeongJae%20Park%22%20%3Csj%40kernel.org%3E > wrote:
>
>
> >
> > On Mon, 25 May 2026 22:48:46 +0800 Kunwu Chan <kunwu.chan@xxxxxxxxx> wrote:
[...]
> > > Reproduced on arm64 (128 CPUs, 7.1.0-rc4):
> > >
> > > before:
> > > WSS estimation: 50th percentile error 100% (reported as zero)
> > > apply_interval: schemes never tried
> > >
> > > after:
> > > WSS estimation: 50th percentile error 0.08%
> > > apply_interval: passes
> > >
> > And nice test results. I guess you are referring to the tests in damon-tests?
> > Clarifying the context would be nice.
> >
> Yes, those results are from: make -C tools/testing/selftests/damon run_tests
> on the arm64 test machine mentioned above.
>
> The before/after summary was extracted from the relevant failing tests
> (sysfs_update_schemes_tried_regions_wss_estimation.py and
> damos_apply_interval.py) for brevity.

Thank you for clarifying!

wss_estimation increases its working set size up to 160 MiB for this issue.
Seems your test machine has large TLB buffer. I think we should decide the
limit based on the real running system configuration and apply similar approach
to other tests including the apply_interval.

For out-of-tree tests, we may better to provide a guideline, too. E.g., run
this sort of test program with this DAMON config to find the reliable test
working set size on your setup.

>
> > Also, have you had a chance to measure the performance impact?
> We haven't done detailed performance measurements yet, but we can try to
> collect some numbers for the flush overhead on a few different setups.
>
> > So, I'd like to have this change. But, unless we have very clear evidence
> > showing this change is not increasing the performance overhead, I'd prefer
> > making this as an optional feature.
> >
> We agree that making it optional sounds safer unless we have solid
> evidence showing the overhead is negligible. Keeping the current
> default behavior for production workloads also makes sense to me.
>
> > For the user interface, we could add a new sysfs file for the option, say,
> > 'flush_sample_tlb' under 'monitoring_attrs' directory.
> >
> The proposed 'flush_sample_tlb' interface under monitoring_attrs sounds
> reasonable to me as well.

I was thinking this again. I still want DAMON to be easy to test. But, is
this making tests that difficult? Users could increase the test working set
size. I'm not very sure that is too diifficult to add new optional feature.
Meanwhille, adding an optional feature for only test might make users be
confused. DAMON usage might also be diverged and add maintenance burdens.

So, now I think another option is improving the documentation. It shouldd
clearly explain how and why DAMON does not flush TLB and what is the expected
problems (in tests) and recommendation. In this option, we should also update
existing DAMON tests to be reliable and aligned with the documented
recommendation. If we find it becomes a problem on testing even after applying
the recommendation, or on production, we can revisit.

Regardless of the decision about the optional feature in DAMON, I think such
documentation and tests improvement should be made.

Maybe I'm biased, so any input would be appreicatedd. What do you think, Kunwu
and Lian?

Thanks,
SJ

[...]