Re: [RFC PATCH 0/6] mm/damon: Add mTHP-aware collapse/split with ARM SPE feedback

From: Wang Lian

Date: Thu Jun 18 2026 - 23:40:56 EST

Hi SeongJae,

Thank you for the thorough and thoughtful review. Your feedback on the
x86 AF behavior was an important correction -- I'll address that and
your other questions below.

On Thu, 18 Jun 2026 SeongJae Park <sj@xxxxxxxxxx> wrote:

> This makes sense to me. I also agree this could caused the reported
> problem. And this is a known limitation of DAMON. My suggestion for
> straightforward workaround of this problem is, using 'age' information
> of DAMON for better identification of the hot memory.

Thank you for pointing out idle time percentiles [1]. We agree that 'age'
helps differentiate frequently-accessed from occasionally-accessed regions,
and it is a good workaround for many cases.

However, age operates at region granularity, which is still at or above
PMD level for THP-mapped memory. When only a few 4KB subpages within a
2MB THP are hot, age tells us the region has been accessed recently, but
not which subpages are hot. The split decision needs sub-PMD information,
which is what the SPE heatmap provides.

That said, combining age with split could be valuable: split only regions
that have been consistently hot (high age) AND have sparse sub-page access
patterns. We will explore this.

> > On ARM64, this is compounded by the hardware AF mechanism -- the AF
> > is only set on a TLB miss.
>
> This makes sense to me. However, I don't get how this is contributing
> to the problem. Could you please elaborate?

The AF-on-TLB-miss behavior creates a second-order problem that directly
exacerbates the overestimation.

When DAMON's mkold path clears the PMD AF, it deliberately skips the TLB
flush to minimize overhead. If the dense working set fits entirely within
the L2 TLB (e.g., 16MB workload using 8 PMD entries on Kunpeng 920's 2048-entry
L2 TLB), subsequent hardware accesses hit the valid, stale TLB entries
directly. The hardware MMU never generates a page table walk, so the
in-memory PMD AF stays 0.

Consequently, DAMON sees `nr_accesses = 0` and assumes the region is completely
cold, making it impossible to naturally track the sub-page usage shifts. When
sporadic/noise accesses later hit other parts of this "seemingly cold" PMD
and trigger an isolated TLB refilling, DAMON abruptly sees the whole 2MB
as hot. This binary oscillation (completely blind vs. fully hot) is what
drives the massive overestimation under THP.

We confirmed this TLB-reach aspect empirically via our T1 test:
16MB THP (8 PMDs, 0.4% of L2 TLB reach) -> DAMON tracks 0 accesses (blind)
16GB THP (8192 PMDs, 400% of L2 TLB reach) -> DAMON tracks normally due to natural eviction

> > x86 is not subject to this specific blindness under similar
> > conditions.
>
> To my understanding on x86, same issue exists. If TLB hits, Aceessed
> bit is not set, and DAMON shows it as unaccessed. Am I missing
> something?

You are entirely right, and I was wrong on this point. I re-checked the
kernel source and verified that x86's ptep_test_and_clear_young() does NOT
flush the TLB. Even ptep_clear_flush_young() on x86 deliberately skips the
flush as a performance optimization (arch/x86/mm/pgtable.c:486-502). The
same optimization architectural behavior exists on PowerPC and RISC-V.

Therefore, both architectures are theoretically vulnerable to this stale-TLB
blind spot under identical tightly-fit workloads. Our initial assumption
was biased because T1 was only conducted on ARM64. We will reproduce the
T1 setup on x86 to verify the exact behavior, and I will correct this
claim in the v2 cover letter. Thank you for catching this mistake.

> Nice! Asier was planning to do similar work in future. I think you
> could collaborate to reduce unnecessary duplicates!

Great to hear! We would be happy to collaborate with Asier. I'll reach
out to him to coordinate our efforts.

> I'd suggest making the name simpler and consistent to DAMOS_COLLAPSE,
> though. Say, DAMOS_SPLIT ?

Agreed. DAMOS_SPLIT is cleaner and fits the existing naming convention
perfectly. Will rename in v2.

> So you implemented a debugfs interface? That must be a nice approach
> for PoC. But it may be difficult to be upstreamed as is.
>
> You could build a control plane that decides the exact address ranges
> to split, and directly feed it to DAMOS using DAMOS address filter.

The native perf event approach [3] aligns perfectly with our long-term
Phase 2c plan, and we are highly interested in collaborating on it to
eliminate the userspace daemon and debugfs bridge entirely.

However, since native kernel-side SPE handling is a long-term item, we
will follow your pragmatic alternative suggestion for v2: use DAMOS address
filters or user_input quota goals [2] to feed the split decisions from
userspace cleanly. This allows us to upstream the core infrastructure
(mTHP target_order for collapse and the new DAMOS_SPLIT action) first.

> Do you really need to khugepaged together, when you already have
> DAMOS_COLLAPSE, and anyway you are running DAMON for hugepage splits?

Excellent point. Running both concurrently on the same VMA introduces
redundancy and heavy ping-pong effects.

Option (b) is definitely cleaner: we will let DAMON handle both split and
re-collapse decisions using its own access data. To make this robust in
production environments where khugepaged is globally enabled, we will
explore having the DAMOS_SPLIT path temporarily mark the target ranges
(e.g., via a pseudo-VM_NOHUGEPAGE backing off mechanism) to prevent
khugepaged from immediately undoing DAMON's work.

> > - The rbtree entry TTL (30s) and signal threshold (1/10 of peak) are
> > empirical defaults subject to further tuning.
>
> I don't fully understand this part. Could you please elaborate?

Since ARM SPE samples hardware accesses instruction-by-instruction, the raw
data is highly statistical and noisy.

The TTL (30s) defines the lifecycle of our per-folio rbtree tracking entries.
Entries not updated within 30 seconds are pruned to prevent stale tracking data
from corrupting split decisions after a workload phase change. 30s is selected
to comfortably outlive DAMON's aggregation intervals while keeping the rbtree
memory footprint tightly bounded.

The signal threshold (1/10 of peak) filters out the statistical sampling noise.
Instead of treating any subpage with access > 0 as hot, the algorithm finds the
peak access count inside the 2MB region and only marks sub-chunks with >= 1/10
of that peak as genuinely hot. On Kunpeng 920, this specific threshold successfully
reduced false-hot subpage classifications from ~50% to <5%. We plan to make
these parameters sysfs-configurable.

> > - The ARM64 DAMON blind spot (WSS < L2 TLB reach) is a pre-existing
> > hardware-MMU characteristic, not introduced by this series. Setting
> > nr_accesses/min=0 serves as an effective workaround for the split path.
>
> I don't fully understand this, too. Could you please elaborate and
> enlighten me?

The blind spot creates an operational deadlock for the split infrastructure:
1. WSS < TLB reach -> All THP entries stay cached in TLB.
2. DAMON's page-table scan yields `nr_accesses = 0` globally.
3. A scheme requiring `nr_accesses.min = 1` never fires -> DAMOS_SPLIT is never invoked.
4. THPs remain unsplit -> WSS remains within TLB reach -> Loop returns to step 1.

Setting `nr_accesses.min = 0` and `max = 0` breaks this deadlock. It forces
DAMON to evaluate these seemingly "dead/cold" regions. Once the split handler
invokes, it checks the ARM SPE telemetry (which captures data directly from the
instruction pipeline, completely bypassing the MMU page-table AF limitation).
If SPE reveals a sparse access heatmap, the split is executed. Once shattered into
mTHP/base pages, the TLB reach drops, natural TLB misses resume, and DAMON's
standard page-table tracking fully recovers.

Thanks again for your guidance. The action items for v2 are locked in:
1. Rename DAMOS_MTHP_SPLIT -> DAMOS_SPLIT.
2. Drop debugfs in favor of DAMOS address filters / control plane.
3. Correct x86 AF behavior statements in the cover letter.
4. Coordinate with Asier on split/collapse unification.
5. Implement back-off to prevent khugepaged ping-pong under Option (b).

Best regards,
Wang Lian